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Abstract. We reconsider randomized algorithms for the low-rank approximation of symmetric positive 
semi-definite (SPSD) matrices such as Laplacian and kernel matrices that arise in data analysis and ma- 
chine learning applications. Our main results consist of an empirical evaluation of the performance quality 
and running time of sampling and projection methods on a diverse suite of SPSD matrices. Our results 
highlight complementary aspects of sampling versus projection methods; they characterize the effects of 
common data preprocessing steps on the performance of these algorithms; and they point to important 
differences between uniform sampling and nonuniform sampling methods based on leverage scores. In 
addition, our empirical results illustrate that existing theory is so weak that it does not provide even a qual- 
itative guide to practice. Thus, we complement our empirical results with a suite of worst-case theoretical 
bounds for both random sampling and random projection methods. These bounds are qualitatively supe- 
rior to existing bounds — e.g., improved additive-error bounds for spectral and Frobenius norm error and 
relative-error bounds for trace norm error — and they point to future directions to make these algorithms 
useful in even larger-scale machine learning applications. 



1. Introduction 

We reconsider randomized algorithms for the low-rank approximation of symmetric positive semi- 
definite (SPSD) matrices such as Laplacian and kernel matrices that arise in data analysis and machine 
learning applications. Our goal is to obtain an improved understanding, both empirically and theo- 
retically, of the complementary strengths of sampling versus projection methods on realistic data. Our 
main results consist of an empirical evaluation of the performance quality and running time of sampling 
and projection methods on a diverse suite of dense and sparse SPSD matrices drawn both from machine 
learning as well as more general data analysis applications. These results are not intended to be com- 
prehensive but instead to be illustrative of how randomized algorithms for the low-rank approximation 
of SPSD matrices behave in a broad range of realistic machine learning and data analysis applications. 

In addition to being of interest in their own right, our empirical results point to several directions 
that are not explained well by existing theory. (For example, that the results are much better than 
existing worst-case theory would suggest, and that sampling with respect to the statistical leverage 
scores leads to results that are complementary to those achieved by projection-based methods.) Thus, 
we complement our empirical results with a suite of worst-case theoretical bounds for both random 
sampling and random projection methods. These bounds are qualitatively superior to existing bounds — 
e.g., improved additive-error bounds for spectral and Frobenius norm error and relative-error bounds 
for trace norm error. Importantly, by considering random sampling and random projection algorithms 
on an equal footing, we identify within our analysis deterministic structural properties of the input data 
and sampling/projection methods that are responsible for high-quality low-rank approximation. 

In more detail, our main contributions are fourfold. 

• First, we provide an empirical illustration of the complementary strengths and weaknesses of 
data-independent random projection methods and data-dependent random sampling methods 
when applied to SPSD matrices. We do so for a diverse class of SPSD matrices drawn from 
machine learning and more general data analysis applications, and we consider reconstruction 
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error with respect to the spectral, Frobenius, as well as trace norms. Depending on the pa- 
rameter settings, the matrix norm of interest, the data set under consideration, etc., one or the 
other method might be preferable. In addition, we illustrate how these empirical properties 
can often be understood in terms of the structural nonuniformities of the input data that are of 
independent interest. 

Second, we consider the running time of high-quality sampling and projection algorithms. For 
random sampling algorithms, the computational bottleneck is typically the exact or approxi- 
mate computation of the importance sampling distribution with respect to which one samples; 
and for random projection methods, the computational bottleneck is often the implementation 
of the random projection. By exploiting and extending recent work on "fast" random projec- 
tions and related recent work on "fast" approximation of the statistical leverage scores, we 
illustrate that high-quality leverage-based random sampling and high-quality random projec- 
tion algorithms have comparable running times. Although both are slower than simple (and in 
general much lower-quality) uniform sampling, both can be implemented more quickly than a 
naive computation of an orthogonal basis for the top part of the spectrum. 
Third, our main technical contribution is a set of deterministic structural results that hold for 
any "sketching matrix" applied to an SPSD matrix. (A precise statement of these results is 
given in Theorems [I] [2] and [3] in Section 4.1 ) We call these "deterministic structural results" 



since there is no randomness involved in their statement or analysis and since they depend on 
structural properties of the input data matrix and the way the sketching matrix interacts with 
the input data. In particular, they highlight the importance of the statistical leverage scores 
(and other related structural nonuniformities having to do with the subspace structure of the 
input matrix), which have proven important in other applications of random sampling and 
random projection algorithms. 

Fourth, our main algorithmic contribution is to show that when the low-rank sketching matrix 
represents certain random projection or random sampling operations, then we obtain worst- 
case quality-of-approximation bounds that hold with high probability. (A precise statement 
of these results is given in Lemmas [2] [3] |4j and [5] in Section 4.2 ) These bounds are quali- 



tatively better than existing bounds (when nontrivial prior bounds even exist); they hold for 
reconstruction error of the input data with respect to the spectral norm and trace norm as well 
as the Frobenius norm; and they illustrate how high-quality random sampling algorithms and 
high-quality random projection algorithms can be treated from a unified perspective. 

A novel aspect of our work is that we adopt a unified approach to these low-rank approximation 
questions — unified in the sense that we consider both sampling and projection algorithms on an equal 
footing, and that we illustrate how the structural nonuniformities responsible for high-quality low-rank 
approximation in worst-case analysis also have important empirical consequences in a diverse class 
of SPSD matrices. By identifying deterministic structural conditions responsible for high-quality low- 
rank approximation of SPSD matrices, we highlight complementary aspects of sampling and projection 
methods; and by illustrating the empirical consequences of structural nonuniformities, we provide 
theory that is a much closer guide to practice than has been provided by prior work. More generally, 
we should note that, although it is beyond the scope of this paper, our deterministic structural results 
could be used to check, in an a posteriori manner, the quality of a sketching method for which one 
cannot establish an a priori bound. 

Our analysis is timely for several reasons. First, in spite of the empirical successes of Nystrom-based 
and other randomized low-rank methods, existing theory for the Nystrom method is quite modest. 
For example, existing worst-case bounds such as those of II21II are very weak, especially compared 
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with existing bounds for least-squares regression and general low-rank matrix approximation prob- 
lems 11221 [23l |46H Q Moreover, many other worst-case bounds make very strong assumptions about 
the coherence properties of the input data 11391 128II . Second, there have been conflicting views in 
the literature about the usefulness of uniform sampling versus nonuniform sampling based on the 
empirical statistical leverage scores of the data in realistic data analysis and machine learning appli- 
cations. For example, some work has concluded that the statistical leverage scores of realistic data 
matrices are fairly uniform, meaning that the coherence is small and thus uniform sampling is appro- 
priate [|64[ 13911 ; while other work has demonstrated that leverage scores are often very nonuniform in 
ways that render uniform sampling inappropriate and that can be essential to highlight properties of 
downstream interest [|54[ 14811 . Third, in recent years several high-quality numerical implementations 
of randomized matrix algorithms for least-squares and low-rank approximation problems have been 
developed []3l |50l [6U [55j 14911 . These have been developed from a "scientific computing" perspective, 
where condition numbers, spectral norms, etc. are of greater interest 114-711 , and where relatively strong 
homogeneity assumptions can be made about the input data. In many "data analytics" applications, 
the questions one asks are very different, and the input data are much less well-structured. Thus, we 
expect that some of our results will help guide the development of algorithms and implementations 
that are more appropriate for large-scale analytics applications. 

In the next section, Section [2] we start by presenting some notation, preliminaries, and related prior 
work. Then, in Section[3]we present our main empirical results; and in Section|4]we present our main 
theoretical results. We conclude in Section|5]with a brief discussion of our results in a broader context. 



2. Notation, Preliminaries, and Related Prior Work 

In this section, we introduce the notation used throughout the paper, and we address several prelim- 
inary considerations, including reviewing related prior work. 

2.1. Notation. Let A e R" xn be an arbitrary SPSD matrix with eigenvalue decomposition A = USU T , 
where we partition U and £ as 

(1) U = (U 1 U 2 ) and S = 

Here, U 1 has k columns and spans the top fc-dimensional eigenspace of A, and Y, 1 ^R kxk is full-rankj^] 
We denote the eigenvalues of A with A X (A) > . . . > A n (A). 

Given A and a rank parameter k, the statistical leverage scores of A relative to the best rank-k approxi- 
mation to A equal the squared Euclidean norms of the rows of the n x k matrix U 1 : 

(2) e j = \\(Vi) j \\ 2 . 

The leverage scores provide a more refined notion of the structural nonuniformities of A than does the 
notion of coherence, \i = | max^^ n j which equals (up to scale) the largest leverage score; and they 
have been used historically in regression diagnostics to identify particularly influential or outlying data 
points. Less obviously, the statistical leverage scores play a crucial role in recent work on randomized 
matrix algorithms: they define the key structural nonuniformity that must be dealt with in order to 
obtain high-quality low-rank and least-squares approximation of general matrices via random sampling 
and random projection methods [|46|| . Although Equation ^ defines them with respect to a particular 
basis, the statistical leverage scores equal the diagonal elements of the projection matrix onto the span 
of that basis, and thus they can be computed from any basis spanning the same space. Moreover, they 



I This statement may at first surprise the reader, since an SPSD matrix is an example of a general matrix, and one might 
suppose that the existing theory for general matrices could be applied to SPSD matrices. While this is true, these existing 
methods for general matrices do not in general respect the symmetry or positive semi-definiteness of the input. 
Variants of our results hold trivially if the rank of A is k or less, and so we focus on this more general case here. 
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can be approximated more quickly than the time required to compute that basis with a truncated SVD 
or a QR decomposition 1 12 Oil . 

We denote by S an arbitrary n x I "sketching" matrix that, when post-multiplying a matrix A, maps 
points from R" to . We are most interested in the case where S is a random matrix that represents 
a random sampling process or a random projection process, but we do not impose this as a restriction 
unless explicitly stated. In order to provide high-quality low-rank matrix approximations, we control 
the error of our approximation in terms of the interaction of the sketching matrix S with the eigenspaces 
of A, and thus we let 



(3) n 1 = U{S andn 2 = U 2 S 

denote the projection of S onto the top and bottom eigenspaces of A, respectively. 

Recall that, by keeping just the top k singular vectors, the matrix A k := UiSjUJ is the best rank- 
k approximation to A, when measured with respect to any unitarily- invariant matrix norm, e.g., the 
spectral, Frobenius, or trace norm. For a vector x e M", let ||x||^, for E, = 1,2, oo, denote the 1- 
norm, the Euclidean norm, and the oo-norm, respectively, and let Diag(A) denote the vector consisting 
of the diagonal entries of the matrix A. Then, ||A|| 2 = ||Diag(S)|| 00 denotes the spectral norm of A; 
||A|| F = ||Diag(£)|| 2 denotes the Frobenius norm of A; and ||A|L = ||Diag(S)|| 1 denotes the trace norm 
(or nuclear norm) of A. Clearly, 

l|A|| 2 < ||A|| F < ||A|L < v^||A|| F <n||A|| 2 . 

We quantify the quality of our algorithms by the "additional error" (above and beyond that incurred 
by the best rank-fc approximation to A). In the theory of algorithms, bounds of the form provided by 



(17) and (18) below are known as additive-error bounds, the reason being that the additional error is 
an additive factor of the form e times a size scale that is larger than the "base error" incurred by the 
best rank-fc approximation. In this case, the goal is to minimize the "size scale" of the additional error. 
Bounds of this form are very different and in general weaker than when the additional error enters as 
a multiplicative factor, such as when the error bounds are of the form ||A — A|| < f{n, k, r?)||A — A k \\, 
where /(•) is some function and 17 represents other parameters of the problem. These latter bounds 



are of greatest interest when / = 1 + e, for an error parameter e, as in (19) and (20) below. These 
relative-error bounds, in which the size scale of the additional error equals that of the base error, provide 
a much stronger notion of approximation than additive-error bounds. 



2.2. Preliminaries. In many machine learning and data analysis applications, one is interested in sym- 
metric positive semi-definite (SPSD) matrices, e.g., kernel matrices and Laplacian matrices. One com- 
mon column-sampling-based approach to low-rank approximation of SPSD matrices is the so-called 
Nystrom method [|641 12T1 13911 . The Nystrom method — both randomized and deterministic variants — 
has proven useful in applications where the kernel matrices are reasonably well-approximated by low- 
rank matrices; and it has been applied to Gaussian process regression, spectral clustering and image 
segmentation, manifold learning, and a range of other common machine learning tasks 11641 l63l l25l 
l58l l68l 13911 . The simplest Nystrom-based procedure selects columns from the original data set uni- 
formly at random and then uses those columns to construct a low-rank SPSD approximation. Although 
this procedure can be effective in practice for certain input matrices, two extensions (both of which 
are more expensive) can substantially improve the performance, e.g., lead to lower reconstruction er- 
ror for a fixed number of column samples, both in theory and in practice. The first extension is to 
sample columns with a judiciously-chosen nonuniform importance sampling distribution; and the sec- 
ond extension is to randomly mix (or combine linearly) columns before sampling them. For the random 
sampling algorithms, an important question is what importance sampling distribution should be used to 
construct the sample; while for the random projection algorithms, an important question is how to im- 
plement the random projections. In either case, appropriate consideration should be paid to questions 
such as whether the data are sparse or dense, how the eigenvalue spectrum decays, the nonuniformity 
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properties of eigenvectors, e.g., as quantified by the statistical leverage scores, whether one is interested 
in reconstructing the matrix or performing a downstream machine learning task, and so on. 
The following sketching model subsumes both of these classes of methods. 

• SPSD Sketching Model. Let A be an n x n positive semi-definite matrix, and let S be a matrix of 
size n x I, where I <C n. Take 

C = AS and W = S r AS. 

Then CW^C r is a low-rank approximation to A with rank at most I. 

We should note that the SPSD Sketching Model, formulated in this way, is not guaranteed to be nu- 
merically stable: if W is ill-conditioned, then instabilities may arise in forming the product CW'C T . 
Thus, we are also interested in CW[C r , where W fc is the best rank-fc approximation to W, and where 
k is a rank parameter. For example, one might specify k and then "oversample" by choosing £ > k but 
still be interested in an approximation that has rank no greater than k. Often, "filtering" a low-rank 
approximation in this way through a (lower) rank-fc space has a regularization effect: for example, 
relative-error CUR matrix decompositions are implicitly regularized by letting the "middle matrix" have 
rank no greater than k | |22l 14811 ; and II15II considers a regularization of the uniform column sampling 
Nystrom extension where, before forming the extension, all singular values of W smaller than a thresh- 
old are truncated to zero. For our empirical evaluation, we consider both cases, which we refer to 
as "non-rank-restricted" and "rank-restricted," respectively. For our theoretical results, for simplicity 
of notation, we do not describe the generalization of our results to this rank-restricted model; but we 
note that our analysis could be extended to include this, e.g., by letting the sketching matrix S be a 
combination of a sampling operation and an operation that projects to the best rank-fc approximation. 

The choice of distribution for the sketching matrix S leads to different classes of low-rank approxi- 
mations. For example, if S represents the process of column sampling, either uniformly or according 
to a nonuniform importance sampling distribution, then we refer to the resulting approximation as a 
Nystrom extension; if S consists of random linear combinations of most or all of the columns of A, 
then we refer to the resulting approximation as a projection-based SPSD approximation. In this paper, 
we focus on Nystrom extensions and projection-based SPSD approximations that fit the above SPSD 
Sketching Model. In particular, we do not consider adaptive schemes, which iteratively select columns 
to progressively decrease the approximation error. While these methods often perform well in prac- 
tice 11101 [9l [24l 13911 , rigorous analyses of them are hard to come by — interested readers are referred to 
the discussion in []24l |39j . 

2.3. The Power Method. One can obtain the optimal rank-fc approximation to A by forming an SPSD 
sketch where the sketching matrix S is an orthonormal basis for the range of A k , because with such 
a choice, 

CW ( C T = AS(S r AS) t S r A = A(SS r ASS T ) t A = A(P Ajc AP A J r A = AA*A = A fe . 

Of course, one cannot quickly obtain such a basis; this motivates considering sketching matrices S q 
obtained using the power method: that is, taking S q = A 9 S where q is a positive integer and S e 
R nxt with I > k. As q — * oo, assuming U^S has full row-rank, the matrices S q increasingly capture 
the dominant fc-dimensional eigenspaces of A [29, Chapter 8], so one can reasonably expect that the 
sketching matrix S q produces SPSD sketches of A with lower additional error. 

SPSD sketches produced using q iterations of the power method have lower error than sketches 
produced without using the power method, but are roughly q times more costly to produce. Thus, 
the power method is most applicable when A is such that one can compute the product A 9 S fast. We 
consider the empirical performance of sketches produced using the power method in Section [3] and we 
consider the theoretical performance in Section [4} 
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2.4. Related Prior Work. Motivated by large-scale data analysis and machine learning applications, 
recent theoretical and empirical work has focused on "sketching" methods such as random sampling 
and random projection algorithms. A large part of the recent body of this work on randomized matrix 
algorithms has been summarized in the recent monograph of Mahoney [|46|| and the recent review 
article of Halko, Martinsson, and Tropp [32] . Here, we note that, on the empirical side, both ran- 
dom projection methods (e.g., 11121 l26l I6TH and QH) and random sampling methods (e.g., 11541 14810 
have been used in applications for clustering and classification of general data matrices; and that some 
of this work has highlighted the importance of the statistical leverage scores that we use in this pa- 
per 11541 l48l l46l 1661 . In parallel, so-called Nystrom-based methods have also been used in machine 
learning applications. Originally used by Williams and Seeger to solve regression and classification 
problems involving Gaussian processes when the SPSD matrix A is well-approximated by a low-rank 
matrix [I64II63II . the Nystrom extension has been used in a large body of subsequent work. For example, 
applications of the Nystrom method to large-scale machine learning problems include 11581 [37l l38l 14411 
and []69l|4U|68]], and applications in statistics and signal processing include U53l [71 fTTl [571 l8l ITUl l9l . 

Much of this work has focused on new proposals for selecting columns (e.g., 11691 [67l l42l LB 141 II ) 
and/or coupling the method with downstream applications (e.g., U5l 1171 l34l 1331 1431 |4l ). The most de- 
tailed results are provided by [3911 (as well as the conference papers on which it is based 11371 1361 1381). 
Interestingly, they observe that uniform sampling performs quite well, suggesting that in the data they 
considered the leverage scores are quite uniform, which also motivated the related work [|59l 15111 . This 
is in contrast with applications in genetics H54IL term-document analysis ||48|| , and astronomy [66], 
where the statistical leverage scores were seen to be very nonuniform in ways of interest to the down- 
stream scientist; we return to this issue in Section [3j 

On the theoretical side, much of the work has followed that of Drineas and Mahoney 112111 , who 
provided the first rigorous bounds for the Nystrom extension of a general SPSD matrix. They show that 
when Q(fce~ 4 ln5 _1 ) columns are sampled with an importance sampling distribution that is propor- 
tional to the square of the diagonal entries of A, then 

(4) \\A-CW*C\ < ||A-A fc || f + e^ =i (A)f ; 

holds with probability 1 — 5, where £ = 2, F represents the Frobenius or spectral norm. (Actually, they 
prove a stronger result of the form given in Equation fl4D, except with W^ replaced with W^, where 
W fc represents the best rank-fc approximation to W ||21||7) Subsequently, Kumar, Mohri, and Talwalkar 
show that if pifc ln(fc/5)) columns are sampled uniformly at random with replacement from an A that has 
exactly rank k, then one achieves exact recovery, i.e., A = CW^C r , with high probability [37]. Gittens 
extends this to the case where A is only approximately low-rank II28II . In particular, he shows that if 
I = fl(jLtfclnfc) columns are sampled uniformly at random (either with or without replacement), then 

n + rii n n ( 2n 

(5) ||A-CWC r || 2 < ||A-A fc ||J 1 + — 

with probability exceeding 1 — 5 and 

(6) ||A-CW t C T || 2 < ||A-A fc || 2 + ~||A-A fc |L 

with probability exceeding 1 — 25. 

We have described these prior theoretical bounds in detail to emphasize how strong, relative to the 
prior work, our new bounds are. For example, Equation Q provides an additive-error approximation 
with a very large scale; the bounds of Kumar, Mohri, and Talwalkar require a sampling complexity 
that depends on the coherence of the input matrix H37II , which means that unless the coherence is 
very low one needs to sample essentially all the rows and columns in order to reconstruct the matrix; 
Equation ^ provides a bound where the additive scale depends on n; and Equation ([6]) provides a 
spectral norm bound where the scale of the additional error is the (much larger) trace norm. Table [l] 
compares the bounds on the approximation errors of SPSD sketches derived in this work to those 
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available in the literature. We note further that Wang and Zhang recently established lower-bounds 
on the worst-case relative spectral and trace norm errors of uniform Nystrom extensions [16211 . Our 
Lemma [5] provides matching upper bounds, showing the optimality of these estimates. 

A related stream of research concerns projection-based low-rank approximations of general (i.e., 
non-SPSD) matrices [32, 46]. Such approximations are formed by first constructing an approximate 
basis for the top left invariant subspace of A, and then restricting A to this space. Algorithmically 
one constructs Y = AS, where S is a sketching matrix, then takes Q to be a basis obtained from the QR 
decomposition of Y, and then forms the low-rank approximation QQ T A. The survey paper [32] proposes 
two schemes for the approximation of SPSD matrices that fit within this paradigm: Q(Q r AQ)Q r and 
(AQ)(Q r AQ)' (Q T A). The first scheme — for which [32] provides quite sharp error bounds when S is a 
matrix of i.i.d. standard Gaussian random variables — has the salutary property of being numerically 
stable. On the other hand, although II32II does not provide any theoretical guarantees for the second 
scheme, it points out that this latter scheme produces noticeably more accurate approximations in 
practice. In Section[3j we provide empirical evidence of the superior performance of the second scheme, 



and we show that it is actually an instantiation of the power method (as described in Section 2.3 ) with 
q = 2. Accordingly, the deterministic and stochastic error bounds provided in Section [4] are applicable 
to this SPSD sketch. 

It is worth noting that in [16211 , the authors propose a modified Nystrom method wherein the matrix 
W is replaced by C 1 'A(C t ) r , so that the low rank approximation to A is given by CC T A(C t ) r C r . Note that 
CC^ is another expression for the orthoprojector QQ r onto the range of Y = AS, so this Nystrom method 
is an instantiation of the projection-based low-rank approximations analyzed in [32]. However, [62], 
unlike II32II , considers the case where C is constructed by sampling from the columns of A adaptively. 
The low-rank approximation produced by the algorithm proposed in [16211 satisfies 

E ||A- CC i 'A(C t ) r C r || F < (1 + e) ||A- A fc || F 

when 0(fc/e 2 ) columns are sampled. 



2.5. An overview of our bounds. Our bounds in Table [T] (established as Lemmas [2}|5] in Section [4^2] ) 



exhibit a common structure: for the spectral and Frobenius norms, we see that the additional error 
is on a larger scale than the optimal error, and the trace norm bounds all guarantee relative error 
approximations. This follows from the fact, as detailed in Section |4.1| that low-rank approximations 
that conform to the SPSD sketching model can be understood as forming column-sample/projection- 
based approximations to the square root of A, and thus squaring this approximation yields the resulting 
approximation to A. The squaring process unavoidably results in potentially large additional errors in 
the case of the spectral and Frobenius norms — whether or not the additional errors are large in practice 
depends upon the properties of the matrix and the form of stochasticity used in the sampling process. 
For instance, from our bounds it is clear that Gaussian-based SPSD sketches are expected to have lower 
additional error in the spectral norm than any of the other sketches considered. 

From Table [I] we also see, in the case of uniform Nystrom extensions, a necessary dependence on 
the coherence of the input matrix since columns are sampled uniformly at random. However, we also 
see that the scales of the additional error of the Frobenius and trace norm bounds are substantially 
improved over those in prior results. The large additional error in the spectral norm error bound is 
necessary in the worse case [28]. Lemmas [2j [3] and [4] in Section 4.2 — which respectively address 



leverage-based, Fourier-based, and Gaussian-based SPSD sketches — show that spectral norm additive- 
error bounds with additional error on a substantially smaller scale can be obtained if one first mixes 
the columns before sampling from A or one samples from a judicious nonuniform distribution over 
the columns. 

Table [2] compares the minimum, mean, and maximum approximation errors of several SPSD sketches 



of four matrices (described in Section 3.1) to the optimal rank-fc approximation errors. We consider 



three regimes for I, the number of column samples used to construct the sketch: I = O(fc), I = O(fclnfc), 
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Source 


I 


||A-CW T C J || 2 


l|A-CW T C'j| F 


||A-CW T C'|L 


Prior works 


fl2TH 


£Xe _4 fc) 




opt F + eX" = i4 




[10] 


0(1) 






o(^) hail 


[59] 


f2(/i,.r lnr) 











[39] 


n(i) 


opt 2 + ^ ||A|| 2 


opt F + n(f) 1 / 4 ||A|| 2 




This work 


Lemma [5] uni- 
form column 
sampling 


„ ( n k klnk \ 
"l(l-e) 2 J 


opt 2 (l + f t ) 


opt F + e _1 opt + 


optjl + e -1 ) 


Lemma [2] 
leverage-based 
column sampling 




opt 2 + e 2 opt, 


opt F + eopt^ 


(1 + e 2 )opt„ 


Lemma [3] 

Fourier-based 

projection 


n(e _1 fclnn) 


(l + T^)opt 2+ ^ 


opt F + -v/eopt^ 


(1 + e)opt. 


Lemma |4j 

Gaussian-based 

projection 




(l + e^opt.+ fopt* 


opt F + eopt, 


(1 + e z )opt, 



Table 1. Comparison of our bounds on the approximation errors of several types of 
SPSD sketches with those provided in prior works. Only the asymptotically largest 
terms (as e — * 0) are displayed and constants are omitted, for simplicity. Here, e e 
(0, 1), optj- is the smallest ^-norm error possible when approximating A with a rank-fc 
matrix (fc > Inn), r = rank(A), I is the number of column samples sufficient for the 
stated bounds to hold, k is a target rank, and Lt s is the coherence of A relative to the 
best rank-5 approximation to A. The parameter /3 e (0, 1] allows for the possibility of 



sampling using /3 -approximate leverage scores (see Section 4.2.1 1 rather than the exact 
leverage scores. With the exception of II21II . which samples columns with probability 
proportional to their Euclidean norms, and our novel leverage-based Nystrom bound, 
these bounds are for sampling columns or linear combinations of columns uniformly at 
random. All bounds hold with constant probability. 



and I = O(fclnn). These matrices exhibit a diverse range of properties: e.g., Enron is sparse and has a 
slowly decaying spectrum, while Protein is dense and has a rapidly decaying spectrum. Yet we notice 
that the sketches perform quite well on each of these matrices. In particular, when I = O(fclnn), the 
average errors of the sketches are within 1 + e of the optimal rank-fc approximation errors, where 
e e [0, 1]. Also note that the leverage-based sketches consistently have lower average errors (in all of 
the three norms considered) than all other sketches. Likewise, the uniform Nystrom extensions usually 
have larger average errors than the other sketches. These two sketches represent opposite extremes: 
uniform Nystrom extensions (constructed using uniform column sampling) are constructed using no 
knowledge about the matrix, while leverage-based sketches use an importance sampling distribution 
derived from the SVD of the matrix to determine which columns to use in the construction of the sketch. 

Table [3] illustrates the gap between the theoretical results currently available in the literature and 
what is observed in practice: it depicts the ratio between the error bounds in Table [l] and the average 
errors observed over 30 runs of the SPSD approximation algorithms (the error bound from H59II is not 
considered in the table, as it does not apply at the number of samples I used in the experiments). 



Enron, k = 60 



||A-CV\rc 7 '|| 2 /||A-A t 



Nystrom 
SRFT sketch 
Gaussian sketch 
Leverage sketch 



1.386/1 
1.378/1 
1.378/1 
1.321/1 



fc + 8 

.386/1.386 
.379/1.381 
.380/1.381 
.381/1.386 



I = 
1.386/1 
1.357/1 
1.357/1 
1.039/1 



fclnfe 

.386/1.386 
.360/1.364 
.360/1.364 
.188/1.386 



I = fclnn 
1.386/1.386/1.386 
1.310/1.317/1.323 
1.314/1.318/1.323 
1.039/1.042/1.113 



||A-CW T C'|| F /I|A-A (( 



Nystrom 
SRFT sketch 
Gaussian sketch 
Leverage sketch 



1.004/1 
1.004/1 
1.004/1 
1.002/1 



fc + 8 

.004/1.004 
.004/1.004 
.004/1.004 
.002/1.003 



t = 
0.993/0 
0.994/0 
0.994/0 
0.994/0 



fclnk 

994/0.994 
994/0.994 
994/0.994 
995/0.996 



I = fclnn 
0.972/0.972/0.973 
0.972/0.972/0.972 
0.972/0.972/0.972 
0.988/0.989/0.989 



||A-CW T C'||,/I|A-A t 



Nystrom 
SRFT sketch 
Gaussian sketch 
Leverage sketch 



1.002/1 
1.002/1 
1.002/1 
1.002/1 



fc + 8 

.002/1.003 
.002/1.002 
.002/1.002 
.002/1.003 



1 = 
0.984/0 
0.984/0 
0.984/0 
0.990/0 



fclnfc 

984/0.984 
984/0.984 
984/0.984 
991/0.992 



I = fclnn 
0.943/0.944/0.944 
0.944/0.944/0.944 
0.944/0.944/0.944 
0.977/0.978/0.980 





AbaloneD, a 


= .15, fc = 20 






||A-CW T C'' 


L/liA-yyia 






t = k + 8 


1 = fclnfc 


1 = fclnn 


Nystrom 


2.168/2.455/2.569 


2.022/2.381/2.569 


1.823/2.204/2.567 


SRFT sketch 


2.329/2.416/2.489 


2.146/2.249/2.338 


1.741/1.840/1.918 


Gaussian sketch 


2.347/2.409/2.484 


2.161/2.254/2.361 


1.723/1.822/1.951 


Leverage sketch 


1.508/1.859/2.377 


1.152/1.417/2.036 


0.774/0.908/1.091 




||A- CVfC 


II f /IIa-a ([ || f 






t=k + 8 


^ = fclnfc 


1 = fclnn 


Nystrom 


1.078/1.090/1.098 


1.061/1.078/1.091 


1.026/1.040/1.054 


SRFT sketch 


1.088/1.089/1.090 


1.074/1.075/1.077 


1.034/1.035/1.037 


Gaussian sketch 


1.087/1.089/1.091 


1.073/1.075/1.077 


1.033/1.035/1.036 


Leverage sketch 


1.028/1.040/1.059 


0.998/1.006/1.020 


0.959/0.963/0.968 




UA-CWC 1 


ll./IIA-AJL 






t = k + 8 


f = fclnfc 


1 = fclnn 


Nystrom 


1.022/1.024/1.026 


1.010/1.014/1.016 


0.977/0.980/0.983 


SRFT sketch 


1.024/1.024/1.024 


1.014/1.014/1.014 


0.980/0.980/0.981 


Gaussian sketch 


1.024/1.024/1.024 


1.014/1.014/1.014 


0.980/0.980/0.981 


Leverage sketch 


1.009/1.012/1.016 


0.994/0.997/1.000 


0.965/0.968/0.971 



Protein, fc = 10 



||A-CV\rc 7 '|| 2 /||A-A t 



Nystrom 
SRFT sketch 
Gaussian sketch 
Leverage sketch 



1.570/2 
1.835/1 
1.812/1 
1.345/1 



fc + 8 

104/2.197 
950/2.039 
956/2.058 
644/2.166 



1.496/2. 
1.686/1 
1.653/1 
1.198/1 



fclnfc 

.100/2.196 
.874/2.009 
894/2.007 
.498/2.160 



I = fclnn 
1.023/1.350/2.050 
1.187/1.287/1.405 
1.187/1.293/1.438 
0.942/0.994/1.073 



||A-CW T C'|| F /l|A-A, t || F 



Nystrom 
SRFT sketch 
Gaussian sketch 
Leverage sketch 



1.041/1 
1.049/1 
1.049/1 
1.027/1 



fc + 8 

.054/1.065 
.054/1.058 
.054/1.060 
.036/1.054 



I- 

1.023/1 
1.032/1 
1.032/1 
1.011/1 



fclnfc 

.042/1.054 
037/1.043 
039/1.043 
.018/1.034 



I = fclnn 
0.867/0.877/0.894 
0.873/0.877/0.880 
0.874/0.878/0.883 
0.862/0.868/0.875 



||A-CW T C'||,/I|A-A, t 



Nystrom 
SRFT sketch 
Gaussian sketch 
Leverage sketch 



1.011/1 
1.013/1 
1.013/1 
1.004/1 



fc + 8 

.014/1.018 
.015/1.016 
.015/1.017 
.008/1.014 



0.988/0. 
0.990/0. 
0.991/0. 
0.982/0. 



fclnfc 

.994/0.998 
993/0.995 
993/0.994 
.985/0.991 



I = fclnn 
0.760/0.764/0.770 
0.762/0.764/0.766 
0.762/0.765/0.767 
0.758/0.765/0.771 



WineS, a = 1, fc = 20 



||A-CW T C T || 2 /||A-A t 



Nystrom 
SRFT sketch 
Gaussian sketch 
Leverage sketch 



I =fc + 8 
1.989/2.001/2.002 
1.910/1.938/1.966 
1.903/1.942/1.966 
1.242/1.762/1.995 



1 = fclnfc 
1.987/1.998/2.002 
1.840/1.873/1.905 
1.839/1.873/1.910 
1.000/1.317/1.987 



t = fclnn 
1.739/1.978/2.002 
1.624/1.669/1.709 
1.619/1.670/1.707 
1.000/1.000/1.005 



||A-CW-'C r || F /||A-A t 



Nystrom 
SRFT sketch 
Gaussian sketch 
Leverage sketch 



I =fc + 8 
1.036/1.040/1.043 
1.038/1.039/1.039 
1.038/1.039/1.039 
1.004/1.011/1.018 



1 = fclnfc 
1.028/1.034/1.038 
1.029/1.030/1.030 
1.029/1.030/1.030 
0.996/1.000/1.005 



t = fclnn 
0.998/1.009/1.018 
1.000/1.000/1.001 
1.000/1.000/1.001 
0.994/0.995/0.997 



IIA-CW+CIL/HA-A* 



Nystrom 
SRFT sketch 
Gaussian sketch 
Leverage sketch 



e = k + 8 
1.013/1.015/1.016 
1.014/1.014/1.015 
1.014/1.014/1.015 
1.002/1.005/1.009 



1 = fclnfc 
1.002/1.005/1.007 
1.004/1.004/1.004 
1.004/1.004/1.004 
0.997/0.999/1.002 



t = fclnn 
0.965/0.970/0.976 
0.970/0.970/0.970 
0.970/0.970/0.970 
0.995/0.996/0.997 



Table 2. The min/mean/max ratios of the errors of several non-rank-restricted SPSD sketches to the optimal rank-fc approxi- 
mation error for several of the matrices considered in Table [4] Here k is the target rank and I is the number of column samples 
used to form the SPSD sketches. The min/mean/max ratios were computed using 30 trials for combination of I and sketch- 
ing method. 
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Several trends can be identified; among them, we note that the bounds provided in this paper for 
Gaussian-based sketches come quite close to capturing the errors seen in practice, and the Frobenius 
and trace norm error guarantees of the leverage-based and Fourier-based sketches tend to more closely 
reflect the empirical behavior than the error guarantees provided in prior work for Nystrom sketches. 
Overall, the trace norm error bounds are quite accurate. On the other hand, prior bounds are sometimes 
more informative in the case of the spectral norm (with the notable exception of the Gaussian sketches). 
Several important points can be gleaned from these observations. First, the accuracy of the Gaussian 
error bounds suggests that the main theoretical contribution of this work, the deterministic structural 
results given as Theorems[I]through[3] captures the underlying behavior of the SPSD sketching process. 
This supports our belief that this work provides a foundation for truly informative error bounds. Given 
that this is the case, it is clear that the analysis of the stochastic elements of the SPSD sketching process 
is much sharper in the Gaussian case than in the leverage-score, Fourier, and uniform Nystrom cases. 
We expect that, at least in the case of leverage and Fourier-based sketches, the stochastic analysis can 
and will be sharpened to produce error guarantees almost as informative as the ones we have provided 
for Gaussian-based sketches. 



3. Empirical Aspects of SPSD Low-rank Approximation 

In this section, we present our main empirical results, which consist of evaluating sampling and 
projection algorithms applied to a diverse set of SPSD matrices. In addition to understanding the 
relative merits, in terms of both running time and solution quality, of different sampling/projection 
schemes, we would like to understand the effects of various data preprocessing decisions. The bulk 
of our empirical evaluation considers two random projection procedures and two random sampling 
procedures for the sketching matrix S: for random projections, we consider using SRFTs (Subsampled 
Randomized Fourier Transforms) as well as uniformly sampling from Gaussian mixtures of the columns; 
and for random sampling, we consider sampling columns uniformly at random as well as sampling 
columns according to a nonuniform importance sampling distribution that depends on the empirical 
statistical leverage scores. In the latter case of leverage score-based sampling, we also consider the use 
of both the (naive and expensive) exact algorithm as well as a (recently-developed fast) approximation 



algorithm. Section 3.1 starts with a brief description of the data sets we consider; Section 3.2 describes 



the details of our SPSD sketching algorithms; and then Section[33]briefly describes the effect of various 



data preprocessing decisions. In Section 3.4 we present our main results on reconstruction quality for 
the random sampling and random projection methods; and, in Section |3.5| we discuss running time 
issues, and we present our main results for running time and reconstruction quality for both exact and 
approximate versions of leverage-based sampling. 

We emphasize that we don't intend these results to be "comprehensive" but instead to be "illustrative" 
case-studies — that are representative of a much wider range of applications than have been considered 
previously. In particular, we would like to illustrate the tradeoffs between these methods in different 
realistic applications in order, e.g., to provide directions for future work. For instance, prima facie, 
algorithms based on leverage-based column sampling might be expected to be more expensive than 
those based on uniform column sampling or random projections, but (based on previous work for 
general matrices [22, 23, 46]) they might also be expected to deliver lower approximation errors. 
Similarly, using approximate leverage scores to construct the importance sampling distribution might 
be expected to perform worse than using exact leverage scores, but this might be acceptable given its 
computational advantages. In addition to clarifying some of these issues, our empirical evaluation also 
illustrates ways in which existing theory is insufficient to explain the success of sampling and projection 
methods. This motivates our improvements to existing theory that we describe in Section |4| 



3 Taking 5 too much smaller results in estimates for the number of samples required that exceed the dimensions of the matrices 
considered. 
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source, 


sketch 


pred./obs. spectral error 


pred./obs. Frobenius error 


pred./obs. trace error 








Enron, k = 


60 




112 111, column sampling with 


3041.0 


66.2 


_ 


probabilities proportional to 








squared diagonal entries 








[10], uniform column sampling 






2.0 


with replacement 








[ 39 ] , uniform column sampling 


331.2 


77.7 


_ 


without replacement 








Lemma 


|2| leverage-based col- 


1287.0 


20.5 


1.2 


umn sampling 








Lemma 


3 


Fourier-based 


102.1 


42.0 


1.6 


Lemma 


4 


Gaussian-based 


20.1 


7.6 


1.4 


Lemma 


5 


uniform column sam- 


9.4 


285.1 


9.5 


pling with replacement 














Protein, k = 


10 




[21], column sampling with 


125.2 


18.6 


_ 


probabilities proportional to 








squared diagonal entries 








[10], uniform column sampling 






3.6 


with replacement 








[ 39 ] , uniform column sampling 


35.1 


20.5 




without replacement 








Lemma 


2 


leverage-based 


42.4 


6.2 


2.0 


Lemma 


3 


Fourier-based 


155.0 


20.4 


3.1 


Lemma 


4 


Gaussian-based 


5.7 


5.6 


2.2 


Lemma 


5 


uniform column sam- 


90.0 


63.4 


14.3 


pling with replacement 








AbaloneD, a = .15, k = 20 


[21], column sampling with 


360.8 


42.5 




probabilities proportional to 








squared diagonal entries 








[10], uniform column sampling 






2.0 


with replacement 








[ 39 ] , uniform column sampling 


62.0 


45.7 




without replacement 








Lemma 


2 


leverage-based 


235.4 


14.1 


1.3 


Lemma 


3 


Fourier-based 


70.1 


36.0 


1.7 


Lemma 


4 


Gaussian-based 


8.7 


8.3 


1.3 


Lemma 


5 


uniform column sam- 


13.2 


166.2 


9.0 


pling with replacement 














WineS, a = 1, 


t = 20 




112 111, column sampling with 


408.4 


41.1 




probabilities proportional to 








squared diagonal entries 








[10], uniform column sampling 






2.1 


with replacement 








[ 39 ] , uniform column sampling 


70.3 


44.3 




without replacement 








Lemma 


2 


leverage-based 


244.6 


12.9 


1.2 


Lemma 


3 


Fourier-based 


94.8 


36.0 


1.7 


Lemma 


4 


Gaussian-based 


11.4 


8.1 


1.4 


Lemma 


5 


uniform column sam- 


13.2 


162.2 


9.1 


pling with replacement 









Table 3. Comparison of the empirically observed approximation errors to the guar- 
antees provided in this and other works, for several datasets. Each approximation was 
formed using I = 6k In k samples. To evaluate the error guarantees, 5 = 1/2 was takeig 
and all constants present in the statements of the bounds were replaced with ones. The 
observed errors were taken to be the average errors over 30 runs of the approximation 



algorithms. The datasets, described in Section 3.1 are representative of several classes 
of matrices prevalent in machine learning applications. 
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Name 


Description 


n 


d 


%nnz 


Laplacian Kernels 


HEP 


arXiv High Energy Physics collaboration graph 


9877 


NA 


0.06 


GR 


arXiv General Relativity collaboration graph 


5242 


NA 


0.12 


Enron 


subgraph of the Enron email graph 


10000 


NA 


0.22 


Gnutella 


Gnutella peer to peer network on Aug. 6, 2002 


8717 


NA 


0.09 


Linear Kernels 


Dexter 


bag of words 


2000 


20000 


83.8 


Protein 


derived feature matrix for S. cerevisiae 


6621 


357 


99.7 


SNPs 


DNA microarray data from cancer patients 


5520 


43 


100 


Gisette 


images of handwritten digits 


6000 


5000 


100 


Dense RBF Kernels 


AbaloneD 


physical measurements of abalones 


4177 


8 


100 


WineD 


chemical measurements of wine 


4898 


12 


100 


Sparse RBF Kernels 


AbaloneS 


physical measurements of abalones 


4177 


8 


82.9/48.1 


WineS 


chemical measurements of wine 


4898 


12 


11.1/88.0 


Table 4. 1 


"he data sets used in our empirical evaluation ([40], 1135 1 




SI, DH, 



H16II . El). Here, n is the number of data points, d is the number of features in the 
input space before kernelization, and %nnz is the percentage of nonzero entries in the 
matrix. For Laplacian "kernels," n is the number of nodes in the graph (and thus there 
is no d since the graph is "given" rather than "constructed"). The %nnz for the Sparse 
RBF Kernels depends on the o parameter; see Table |5j 



With respect to our computational environment, all of our computations were conducted using 64-bit 
MATLAB R2012a under Ubuntu on a 2.6-GHz quad-core Intel i7 machine with 6Gb of RAM. To allow 
for accurate timing comparisons, all computations were carried out in a single thread. When applied 
to an n x n SPSD matrix A, our implementation of the SRFT requires 0(n 2 Inn) operations, as it applies 
MATLAB's f f t to the entire matrix A and then it samples I columns from the resulting matrix. We note 
that the SRFT computation can be made more competitive: a more rigorous implementation of the 
SRFT algorithm could reduce this running time to 0(n 2 ln£); but due to the complexities involved in 
optimizing pruned FFT codes, we did not pursue this avenue. 

3.1. Data Sets. Table |4| provides summary statistics for the data sets used in our empirical evaluation. 
In order to illustrate the complementary strengths and weaknesses of different sampling versus pro- 
jection methods in a wide range of realistic applications, we consider four classes of matrices which 
are commonly encountered in machine learning and data analysis applications: normalized Laplacians 
of very sparse graphs drawn from "informatics graph" applications; dense matrices corresponding to 
Linear Kernels from machine learning applications; dense matrices constructed from a Gaussian Ra- 
dial Basis Function Kernel (RBFK); and sparse RBFK matrices constructed using Gaussian radial basis 
functions, truncated to be nonzero only for nearest neighbors. Although not exhaustive, this collection 
of data sets represents a wide range of data sets with very different (sparsity spectral, leverage score, 
etc.) properties that have been of interest recently not only in machine learning but in data analysis 
more generally. 

To understand better the Laplacian data, recall that, given an undirected graph with weighted adja- 
cency matrix W, its normalized graph Laplacian is 



A = I-D" 1/2 WD" 1/2 , 



REVISITING THE NYSTROM METHOD FOR IMPROVED LARGE-SCALE MACHINE LEARNING 



13 



where D is the diagonal matrix of weighted degrees of the nodes of the graph, i.e., D u = W t j. This 
Laplacian is an SPSD matrix, but note that not all SPSD matrices can be written as the Laplacian of 
a graph. 

The remaining datasets are positive-semidefinite kernel matrices associated with datasets drawn 
from a variety of application areas. Recall that, given given points x 1; ...,x n e Mr and a function 
k : M d x M d — > M, the nx n matrix with elements 

A ij = K(Xi,*j) 

is called the kernel matrix of k with respect to x 1; ...,x n . Appropriate choices of k ensure that A is 
positive semidefinite. When this is the case, the entries A i; - can be interpreted as measuring, in a 
sense determined by the choice of k, the similarity of points i and j. Specifically, if A is SPSD, then k 
determines a so-called feature map <E> h . : M d — » M n such that 

A = ($ K (x ; ),«f,(x ; )) 

measures the similarity (correlation) of x ; and x ; - in feature space [15611 . 
When k is the usual Euclidean inner-product, so that 

A{j = ( x i) x /c)) 

A is called a Linear Kernel matrix. Gaussian RBFK matrices, defined by 



A° = exp 



H X ' X }\\2 



' \ a 2 

correspond to the similarity measure /c(x,y) = exp(— ||x — yll^/c 2 ). Here cr, a nonnegative number, 
defines the scale of the kernel. Informally, a defines the "size scale" over which pairs of points x ; and x ; 
"see" each other. Typically cr is determined by a global cross-validation criterion, as A a is generated for 
some specific machine learning task; and, thus, one may have no a priori knowledge of the behavior of 
the spectrum or leverage scores of A a as cr is varied. Accordingly, we consider Gaussian RBFK matrices 
with different values of cr. 

Finally, given the same data points, x 1; . . . ,x n , one can construct sparse Gaussian RBFK matrices 

+ II Il2 



A (cr,v,C) 



1 _ H X ' X j||2 



■ exp 



|| X i X i||2 

a 2 



where = max{0, x}. When v is larger than (d + l)/2, this kernel matrix is positive semidefi- 

nite [27]. Increasing v shrinks the magnitudes of the off-diagonal entries of the matrix toward zero. 
As the cutoff point C decreases the matrix becomes more sparse; in particular, C — * ensures that 
aOa.c) _^ j Q n Qfjjgj. j ian( j > q _> oo ensures that A^ a ' v ' C ^ approaches the (dense) Gaussian RBFK 
matrix A a . For simplicity, in our empirical evaluations, we fix v = [(d + l)/2] and C = 3a, and we 
vary cr. As with the effect of varying cr, the effect of varying the sparsity parameter C is not obvious a 
priori — C is typically chosen according to a global criterion to ensure good performance at a specific 
machine learning task, without consideration for its effect on the spectrum or leverage scores of A^' V ' C \ 
To illustrate the diverse range of properties exhibited by these four classes of data sets, consider 
Table [5] Several observations are particularly relevant to our discussion below. 

• All of the Laplacian Kernels drawn from informatics graph applications are extremely sparse 
in terms of number of nonzeros, and they all tend to have very slow spectral decay, as illus- 
trated both by the quantity [ ||A||p / HAH? ] (this is the stable rank, which is a numerically stable 
(under) estimate of the rank of A) as well as by the relatively small fraction of the Frobenius 
norm that is captured by the best rank-fc approximation to A. For the Laplacian Kernels we 
considered two values of the rank parameter k that were chosen (somewhat) arbitrarily; many 
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fcth-largest 
leverage score 
scaled by n/k 



Name 



%nnz 



II 



^k 



100 j 



A— At 



l|A||, 



100 J 



l|A||. 



HEP 

HEP 

GR 

GR 

Enron 

Enron 

Gnutella 

Gnutella 



0.06 
0.06 
0.12 
0.12 
0.22 
0.22 
0.09 
0.09 



3078 
3078 
1679 
1679 
2588 
2588 
2757 
2757 



20 
60 
20 
60 
20 
60 
20 
60 



0.998 
0.998 
0.999 
1 

0.997 
0.999 
1 

0.999 



7.8 

13.2 

10.5 

17.9 

7.77 

12.0 

8.1 

13.7 



0.4 

1.1 

0.74 

2.16 

0.352 

0.94 

0.41 

1.20 



128.8 

41.9 

71.6 

25.3 

245.8 

49.6 

166.2 

49.4 



Dexter 
Protein 
SNPs 
Gisette 



83.8 
99.7 
100 
100 



176 
24 
3 
4 



8 

10 

5 

12 



0.963 
0.987 
0.928 
0.90 



14.5 
42.6 
85.5 
90.1 



.934 
7.66 
37.6 
14.6 



16.6 
5.45 
2.64 
2.46 



AbaloneD (dense, a = .15) 
AbaloneD (dense, a = 1) 
WineD (dense, o = 1) 
WineD (dense, a = 2.1) 



100 
100 
100 
100 



41 

4 
31 

3 



20 
20 
20 
20 



0.992 
0.935 
0.99 
0.936 



42.1 
97.8 
43.1 
94.8 



3.21 
59 
3.89 
31.2 



18.11 
2.44 
26.2 
2.29 



AbaloneS (sparse, a = .15) 
AbaloneS (sparse, a = 1) 
WineS (sparse, u = 1) 
WineS (sparse, u = 2.1) 



82.9 
48.1 
11.1 
88.0 



400 
5 

116 

39 



20 
20 
20 
20 



0.989 
0.982 
0.995 
0.992 



15.4 
90.6 
29.5 
41.6 



1.06 
21.8 
2.29 
3.53 



48.4 
3.57 
49.0 
24.1 



Table 5. Summary statistics for the data sets from Table |4j that we used in our empirical evaluation 



of the results we report continue to hold qualitatively if k is chosen to be (say) an order of 
magnitude larger. 

• Both the Linear Kernels and the Dense RBF Kernels are much denser and are much more well- 
approximated by moderately to very low-rank matrices. In addition, both the Linear Kernels 
and the Dense RBF Kernels have statistical leverage scores that are much more uniform — 
there are several ways to illustrate this, none of them perfect, and here, we illustrate this by 
considering the k th largest leverage score, scaled by the factor n/k (if A were exactly rank k, 
this would be the coherence of A). For the Linear Kernels and the Dense RBF Kernels, this 
quantity is typically one to two orders of magnitude smaller than for the Laplacian Kernels. 

• For the Dense RBF Kernels, we consider two values of the cr parameter, again chosen (some- 
what) arbitrarily. For both AbaloneD and WineD, we see that decreasing cr from 1 to 0.15, 
i.e., letting data points "see" fewer nearby points, has two important effects: first, it results in 
matrices that are much less well-approximated by low-rank matrices; and second, it results in 
matrices that have much more heterogeneous leverage scores. For example, for AbaloneD, the 
fraction of the Frobenius norm that is captured decreases from 97.8 to 42.1 and the scaled k th 
largest leverage score increases from 2.44 to 18.11. 

• For the Sparse RBF Kernels, there are a range of sparsities, ranging from above the sparsity 
of the sparsest Linear Kernel, but all are denser than the Laplacian Kernels. Changing the cr 
parameter has the same effect (although it is even more pronounced) for Sparse RBF Kernels as 
it has for Dense RBF Kernels. In addition, "sparsifying" a Dense RBF Kernel also has the effect 
of making the matrix less well approximated by a low-rank matrix and of making the leverage 
scores more nonuniform. For example, for AbaloneD with cr = 1 (respectively, a = 0.15), the 
fraction of the Frobenius norm that is captured decreases from 97.8 (respectively, 42.1) to 90.6 
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(respectively, 15.4), and the scaled k th largest leverage score increases from 2.44 (respectively, 

18.11) to 3.57 (respectively, 48.4). 
As we see below, when we consider the RBF Kernels as the width parameter and sparsity are varied, 
we observe a range of intermediate cases between the extremes of the ("nice") Linear Kernels and the 
(very "non-nice") Laplacian Kernels. 

3.2. SPSD Sketching Algorithms. The sketching matrix S may be selected in a variety of ways. We 
will provide empirical results for two sampling-based SPSD sketches and two projection-based SPSD 
sketches. In the former case, the sketching matrix S contains exactly one nonzero in each column, 
corresponding to a single sample from the columns of A. In the latter case, S is dense, and mixes the 
columns of A before sampling from the resulting matrix. 

In more detail, we consider two types of sampling-based SPSD sketches (i.e. Nystrom extensions): 
those constructed by sampling columns uniformly at random with replacement, and those constructed 
by sampling columns from a distribution based upon the leverage scores of the matrix filtered through 
the optimal rank-fc approximation of the matrix. In the case of column sampling, the sketching matrix 
S is simply the first £ columns of a matrix that was chosen uniformly at random from the set of all 
permutation matrices. 

In the case of leverage-based sampling, S has a more complicated distribution. Recall that the 
leverage scores relative to the best rank-fc approximation to A are the squared Euclidean norms of the 
rows of the n x k matrix V 1 : 

^IIOHU 2 . 

It follows from the orthonormality of Ux that ^Alj/k) = 1, and the leverage scores can thus be 
interpreted as a probability distribution over the columns of A. To construct a sketching matrix corre- 
sponding to sampling from this distribution, we first select the columns to be used by sampling with 
replacement from this distribution. Then, S is constructed as S = RD where R e R nxt is a column 
selection matrix that samples columns of A from the given distribution — i.e., R i; = 1 iff the ith column 
of A is the jth column selected — and D is a diagonal rescaling matrix satisfying D ;; = — }= iff R i; = 1. 



Jipi 

It is often expensive to compute the leverage scores exactly; and so in Section [33} we consider the em- 
pirical performance of sketches based on several different approximation algorithms for the leverage 
scores. The sketching matrices for these approximations take the same form; the only difference is the 
distribution used to select the column samples. 

The two projection-based sketches we consider are based upon Gaussians and the real Fourier trans- 
form. In the former case, S is a matrix of i.i.d. Jf{0, 1) random variables. In the latter case, S is a 
subsampled randomized Fourier transform (SRFT) matrix; that is, S = ^jDFR, where D is a diagonal 
matrix of Rademacher random variables, F is the real Fourier transform matrix, and R restricts to I 
columns 

In the figures, we refer to sketches constructed by selecting columns uniformly at random with the 
label 'unif, leverage score-based sketches with 'lev', Gaussian sketches with 'gaussian', and Fourier 
sketches with 'srft'. 

3.3. Effects of Data Analysis Preprocessing Decisions. Before proceeding with our main empirical 
results, we pause to describe the effects of various machine learning and data analysis "design de- 
cisions" on the behavior of SPSD sketching algorithms in general as well as on the behavior of the 
statistical leverage scores in particular. We should emphasize that, for "worst case" matrices, very little 
can be said in this regard. Thus, these observations are based on our experiences with a diverse range 



of data sets, including those from Section 3.1 While not completely general, these observations are 
likely to hold in modified form for many other realistic data, and they can potentially be useful as 
heuristic guides to practice. For example, if preprocessing does not significantly change the leverage 
score distribution, then one could compute the leverage scores on the raw data and use these to sample 
columns from the processed data or to certify that the data have low coherence. Likewise, the behavior 
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of the leverage scores as the rank parameter k is varied or as the a scale parameter of RBF kernels 
varies is of interest, as it is expensive to compute the leverage scores anew for each value of k or cr as 
part of a cross-validation computation. 

One common preprocessing step is to "whiten" the data before applying a machine learning algo- 
rithm. If the data are given in the form of X e M nxd where the tth row of X is an observation of 
d covariates, then these covariates may have different means and characteristic size scales (i.e., vari- 
ances). In this case, it is often appropriate to transform the covariates so they all have zero mean and 
are on the same size scale. The whitening transform generates a new matrix X, corresponding to these 
transformed covariates, by removing the mean of each column and rescaling the columns so they all 
have unit norm. In our experience, whitening modifies the statistical leverage scores, often by making 
them somewhat more homogeneous, but for a fixed rank parameter k it does not change them too 
substantially, e.g., to within no more than a multiplicative factor of 2. Given the sensitivity of matrix 
reconstruction algorithms to various structural properties of the input data that we describe below, 
however, the more important observation is that whitening tends to decrease the effective rank of the 
input data set, and at the same time it often tends to shrink the spectral gaps. As shown below, this has 
observable consequences on the reconstruction errors of all the sketching methods considered, but in 
particular those involving approximate leverage score computations. 

Another preprocessing decision has to do with the choice of rank k with which to describe the data. 
This is typically determined according to an exogeneously-specified "model selection" criterion that 
does not explicitly take into account the spectrum or leverage score structure of the input matrix. 
It enters our discussion since we consider sampling columns with probabilities proportional to their 
statistical leverage scores relative to a rank-k space, and thus the leverage scores depend on k. In our 
experience, increasing k tends to uniformize or homogeneize the leverage scores, often gradually, but 
sometimes quite substantially. (We should note, however, that there are exceptions to this, where one 
observes very strong localization on low-order eigenvectors of data matrices [18] .) 

Yet another preprocessing decision has to do with the choice of the cr scale parameter in Gaussian 
RBFK matrices. As with the rank parameter, the scale parameter cr in practice is determined according 
to an exogeneously-specified model selection criterion that does not explicitly take into account the 
spectrum or leverage score structure of the input matrix. In our experience, as cr increases, the lever- 
age scores become more and more uniform; and they become more heterogeneous as a decreases. 
Informally, as a data point "sees" more data points, any outlying effect is mitigated. Varying cr also has 
an effect on the spectrum. As a general rule, letting cr — > tends to make the spectrum of A a flatter, 
i.e., decay more slowly, and letting cr — * oo makes A a lower-rank. Recall that the diagonal entries of A a 
are identically one, and as cr — » oo, A a tends to the matrix of all ones. That is, increasing cr corresponds 
to considering all the observations x ; as being equally dissimilar, so all columns are equally noninfor- 
mative. On the other hand, as cr — » 0, A a approaches the identity, and very dissimilar observations 
(in the sense that ||x f — x ; || 2 is large) are penalized more heavily than similar observations, and thus 
there is some nonuniformity in the columns of A CT . In some cases, we observed that, as the scale cr 
decreases, the leverage scores stabilize, identifying the same columns as being important or influential 
over a range of scales. 



3.4. Reconstruction Accuracy of Sampling and Projection Algorithms. Here, we describe the per- 
formances of the SPSD sketches described in Section [372] — column sampling uniformly at random with- 
out replacement, column sampling according to the nonuniform leverage score probabilities, and sam- 
pling using Gaussian and SRFT mixtures of the columns — in terms of reconstruction accuracy for the 



data sets described in Section 3.1 We describe general observations we have made about each class of 
matrices in turn, and then we summarize our observations. We consider only the use of exact leverage 
scores here, and we postpone until Section |3.5| a discussion of running time issues and similar recon- 
struction results when approximate leverage scores are used for the importance sampling distribution. 
In each case, we present results for both the "non-rank-restricted" case as well as the "rank-restricted" 



REVISITING THE NYSTROM METHOD FOR IMPROVED LARGE-SCALE MACHINE LEARNING 



17 



case. Recall that by non-rank-restricted, we mean that the error 
(7) ||A-CW t C r || f /||A-A fc |L 

is plotted; while by rank-restricted, we mean that the error 

(8) 



A - CW] C T 

k 



? /||A-A, 



is plotted (viz., the matrix W in Eqn. ([7]) has been replaced with the low-rank approximation Wj.). Note 
that previous work has shown that relative-error guarantees can be obtained, e.g., with CUR matrix 
decompositions, not only when one projects onto the span of judiciously-chosen columns, analogously 
to Eqn. and as our worst-case guarantees in this paper are formulated, but also when one restricts 
the rank of the low-rank approximation to be no greater than k by projecting onto the best rank- 
k approximation to the original matrix II22II . We evaluate the "rank-restricted" case of the form of 
Eqn. ([8]), that depends on projecting onto the best rank-fc approximation of the subsample (and not 
the original matrix) since it is more algorithmically tractable; but we note that similar but "smoother" 
results (e.g., the error is much more monotonic as a function of the number of samples, when compared 
with the "rank-restricted" results we present below) are obtained empirically with this more expensive 
rank-restriction procedure. The data points plotted in each figure of this section represent the average 
errors observed over 30 trials. 

Finally, we note that previous work has shown that the statistical leverage scores reflect an impor- 
tant nonuniformity structure in the columns of general data matrices 11481 14611 ; that randomly sampling 
columns according to this distribution results in lower worst-case error (for problems such as least- 
squares approximation and low-rank approximation of general matrices) than sampling columns uni- 
formly at random [221 1231 l46l ; and that leverage scores have proven useful in a wide range of practical 
applications [1541 l48l l46l I66H . In spite of this, ours is the first work to implement and evaluate leverage 
score sampling for low-rank approximation of SPSD matrices. 

3.4.1. Graph Laplacians. Figure [I] and Figure [2] show the reconstruction error results for sampling and 
projection methods applied to several normalized graph Laplacians. The former shows GR and HEP 
each for two values of the rank parameter, and the latter shows Enron and Gnutella, again each for two 
values of the rank parameter. Both figures show the spectral, Frobenius, and trace norm approximation 
errors, as a function of the number of column samples I, relative to the error of the optimal rank-fc 
approximation of A. In both figures, the first four (i.e., top) subfigures show the results for the non- 
rank-restricted case, and the last four (i.e., bottom) subfigures show the results for the rank-restricted 
case. In particular, in the rank-restricted case, the low-rank approximation is "filtered" through a rank-fc 
space, and thus the approximation ratio is always greater than unity. 

These and subsequent figures contain a lot of information, some of which is peculiar to the given 
data sets and some of which is more general. In light of subsequent discussion, several observations 
are worth making about the results presented in these two figures. 

• All of the SPSD sketches provide quite accurate approximations — relative to the best possible 
approximation factor for that norm, and relative to bounds provided by existing theory, as 
reviewed in Section |2.4| — even with only k column samples (or in the case of the Gaussian 
and SRFT mixtures, with only k linear combinations of vectors). Upon examination, this is 
partly due to the extreme sparsity and extremely slow spectral decay of these data sets which 
means, as shown in Table [4] that only a small fraction of the (spectral or Frobenius or trace) 
mass is captured by the optimal rank 20 or 60 approximation. Thus, although an SPSD sketch 
constructed from 20 or 60 vectors also only captures a small portion of the mass of the matrix, 
the relative error is small, since the scale of the residual error is large. 

• The scale of the Y axes is different between different figures and subfigures. This is to highlight 
properties within a given plot, but it can hide several things. In particular, note that the scale 
for the spectral norm is generally larger than for the Frobenius norm, which is generally larger 
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Figure 1. The spectral, Frobenius, and trace norm errors (top to bottom, respectively, 
in each subfigure) of several (non-rank-restricted in top panels and rank-restricted in 
bottom panels) SPSD sketches, as a function of the number of columns samples I, for 
the GR and HEP Laplacian data sets, with two choices of the rank parameter k. 
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Figure 2. The spectral, Frobenius, and trace norm errors (top to bottom, respectively, 
in each subfigure) of several (non-rank-restricted in top panels and rank-restricted in 
bottom panels) SPSD sketches, as a function of the number of columns samples I, for 
the Enron and Gnutella Laplacian data sets, with two choices of the rank parameter k. 



20 



ALEX GITTENS AND MICHAEL W. MAHONEY 



than for the trace norm, consistent with the size of those norms; and that the scale is larger for 
higher-rank approximations, e.g. compare GR k = 20 with GR k = 60, also consistent with the 
larger amount of mass captured by higher-rank approximations. 

• Both the non-rank-restricted and rank-restricted results are the same for £ — k. For £ > k, 
the non-rank-restricted errors tend to decrease (or at least not increase, as for GR and HEP 
the spectral norm error is flat as a function of £), which is intuitive. While the rank-restricted 
errors also tend to decrease for £ > k, the decrease is much less (since the rank-restricted 
plots are bounded below by unity) and the behavior is much more complicated as a function of 
increasing I. 

• The X axes ranges from k to 9k for the k = 20 plots and from k to 3k for the k = 60 plots. 
As a practical matter, choosing £ between k and (say) 2k or 3k is probably of greatest interest. 
In this regime, there is an interesting tradeoff for the non-rank-restricted plots: for moderately 
large values of £ in this regime, the error for leverage-based sampling is moderately better than 
for uniform sampling or random projections, while if one chooses I to be much larger then the 
improvements from leverage-based sampling saturate and the uniform sampling and random 
projection methods are better. This is most obvious in the Frobenius norm plots, although it 
is also seen in the trace norm plots, and it suggests that some combination of leverage-based 
sampling and uniform sampling might be best. 

• For the rank-restricted plots, in some cases, e.g., with GR and HEP, the errors for leverage-based 
sampling are much better than for the other methods and quickly improve with increasing £ 
until they saturate; while in other cases, e.g., with Enron and Gnutella, the errors for leverage- 
based sampling improve quickly and then degrade with increasing £. Upon examination, the 
former phenomenon is similar to what was observed in the non-rank-restricted case and is due 
to the strong "bias" provided by the leverage score importance sampling distribution to the 
top part of the spectrum, allowing the sampling process to focus very quickly on the low-rank 
part of the input matrix. (In some cases, this is due to the fact that the heterogeneity of the 
leverage score importance sampling distribution means that one is likely to choose the same 
high leverage columns multiple times, rather than increasing the accuracy of the sketch by 
adding new columns whose leverage scores are lower.) The latter phenomenon of degrading 
error quality as £ is increased is more complex and seems to be due to some sort of "overfitting" 
caused by this strong bias and by choosing many more than k columns. 

• The behavior of the approximations with respect to the spectral norm is quite different from the 
behavior in the Frobenius and trace norms. In the latter, as the number of samples £ increases, 
the errors tend to decrease, although in an erratic manner for some of the rank-restricted plots; 
while for the former, the errors tend to be much flatter as a function of increasing £ for at least 
the Gaussian, SRFT, and uniformly sampled sketches. 

All in all, there seems to be quite complicated behavior for low-rank sketches for these Laplacian data 
sets. Several of these observations can also be made for subsequent figures; but in some other cases the 
(very sparse and not very low rank) structural properties of the data are primarily responsible. 

3.4.2. Linear Kernels. Figure [3] shows the reconstruction error results for sampling and projection meth- 
ods applied to several Linear Kernels. The data sets (Dexter, Protein, SNPs, and Gisette) are all quite 
low-rank and have fairly uniform leverage scores. Several observations are worth making about the 
results presented in this figure. 

• All of the methods perform quite similarly for the non-rank-restricted case: all have errors that 
decrease smoothly with increasing £, and in this case there is little advantage to using methods 
other than uniform sampling (since they perform similarly and are more expensive). Also, since 
the ranks are so low and the leverage scores are so uniform, the leverage score sketch is no 
longer significantly distinguished by its tendency to saturate quickly. 
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Figure 3. The spectral, Frobenius, and trace norm errors (top to bottom, respectively, 
in each subfigure) of several (non-rank-restricted in top panels and rank-restricted in 
bottom panels) SPSD sketches, as a function of the number of columns samples I, for 
the Linear Kernel data sets. 
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• The scale of the Y axes is much larger than for the Laplacian data sets, mostly since the matrices 
are much more well-approximated by low-rank matrices, although the scale decreases as one 
goes from spectral to Frobenius to trace reconstruction error, as before. 

• For SNPs and Gisette, the rank-restricted reconstruction results are very similar for all four 
methods, with a smooth decrease in error as I is increased, although interestingly using lever- 
age scores is slightly worse for Gisette. For Dexter and Protein, the situation is more compli- 
cated: using the SRFT always leads to smooth decrease as I is increased, and uniform sampling 
generally behaves the same way also; Gaussian projections behave this way for Protein, but for 
Dexter Gaussian projections are noticably worse than SRFT and uniform sampling; and, except 
for very small values of I, leverage-based sampling is worse still and gets noticably worse as I is 
increased. Even this poor behavior of leverage score sampling on the Linear Kernels is notably 
worse than for the rank-restricted Laplacians, where there was a range of moderately small I 
where leverage score sampling was much superior to other methods. 

These linear kernels (and also to some extent the dense RBF kernels below that have larger a param- 
eter) are examples of relatively "nice" machine learning data sets that are similar to matrices where 
uniform sampling has been shown to perform well previously 11581 [37l l38l 13911 ; and for these matrices 
our empirical results agree with these prior works. 

3.4.3. Dense and Sparse RBF Kernels. Figure [4] and Figure [5] present the reconstruction error results 
for sampling and projection methods applied to several dense RBF and sparse RBF kernels. Several 
observations are worth making about the results presented in these figures. 

• For the non-rank-restricted results, all of the methods have errors that decrease with increasing 
I. In particular, for larger values of a and for denser data, the decrease is somewhat more 
regular, and the four methods tend to perform similarly. For larger values of a and sparser 
data, leverage score sampling is somewhat better. This parallels what we observed with the 
Linear Kernels, except that here the leverage score sampling is somewhat better for all values 
oft. 

• For the non-rank-restricted results for the smaller values of a, leverage score sampling tends to 
be much better than uniform sampling and projection-based methods. For the sparse data, how- 
ever, this effect saturates; and we again observe (especially when a is smaller in AbaloneS and 
WineS) the tradeoff we observed previously with the Laplacian data — leverage score sampling 
is better when I is moderately larger than k, while uniform sampling and random projections 
are better when I is much larger than k. 

• For the rank-restricted results, we see that when a is large, all of the results tend to perform 
similarly. (The exception to this is WineS, for which leverage score sampling starts out much 
better than other methods and then gets worse as I is increased.) On the other hand, when a 
is small, the results are more complex. Leverage score sampling is typically much better than 
other methods, although the results are quite choppy as a function of I, and in some cases the 
effect diminished as I is increased. 

Recall from Table [5] that for smaller values of cr and for sparser kernels, the SPSD matrices are less 
well-approximated by low-rank matrices, and they have more heterogeneous leverage scores. Thus, 
they are more similar to the Laplacian data than the Linear Kernel data; and this suggests (as we have 
observed) that leverage score sampling should perform relatively better than uniform column sampling 
and projection-based schemes when in these two cases. In particular, nowhere do we see that leverage 
score sampling performs much worse than other methods, as we saw with the rank-restricted Linear 
Kernel results. 

3.4.4. Summary of Comparison of Sampling and Projection Algorithms. Before proceeding, there are 
several summary observations that we can make about sampling versus projection methods for the 
data sets we have considered. 
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Figure 4. The spectral, Frobenius, and trace norm errors (top to bottom, respectively, 
in each subfigure) of several (non-rank-restricted in top panels and rank-restricted in 
bottom panels) SPSD sketches, as a function of the number of columns samples I, for 
several dense RBF data sets. 
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Figure 5. The spectral, Frobenius, and trace norm errors (top to bottom, respectively, 
in each subfigure) of several (non-rank-restricted in top panels and rank-restricted in 
bottom panels) SPSD sketches, as a function of the number of columns samples I, for 
several sparse RBF data sets. 
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• Linear Kernels and to a lesser extent Dense RBF Kernels with larger a parameter have relatively 
low-rank and relatively uniform leverage scores, and in these cases uniform sampling does quite 
well. These data sets correspond most closely with those that have been studied previously in 
the machine learning literature, and for these data sets our results are in agreement with that 
prior work. 

• Sparsifying RBF Kernels and/or choosing a smaller a parameter tends to make these kernels 
less well-approximated by low-rank matrices and to have more heterogeneous leverage scores. 
In general, these two properties need not be directly related — the spectrum is a property of 
eigenvalues, while the leverage scores are determined by the eigenvectors — but for the data 
we examined they are related, in that matrices with more slowly decaying spectra also often 
have more heterogeneous leverage scores. 

• For Dense RBF Kernels with smaller a and Sparse RBF Kernels, leverage score sampling tends 
to do much better than other methods. Interestingly, the Sparse RBF Kernels have many 
properties of very sparse Laplacian Kernels corresponding to relatively-unstructured informat- 
ics graphs, an observation which should be of interest for researchers who construct sparse 
graphs from data using, e.g., "locally linear" methods, to try to reconstruct hypothesized low- 
dimensional manifolds. 

• Reconstruction quality under leverage score sampling saturates, as a function of choosing more 
samples t; this is seen both for non-rank-restricted and rank-restricted situations. As a conse- 
quence, there can be a tradeoff between leverage score sampling or other methods being better, 
depending on the values of I that are chosen. 

• Although they are potentially ill-conditioned, non-rank-restricted approximations behave bet- 
ter in terms of reconstruction quality. Rank-constrained approximations tend to have much 
more complicated behavior as a function of increasing the numbe of samples I, including chop- 
pier and non-monotonic behavior. This is particularly severe for leverage score sampling, but 
it occurs with other methods; and it suggests that other forms of regularization (other than 
what is essentially a Tikhonov form of regularization for the rank-restricted cases) might be 
appropriate. 

In general, all of the sampling and projection methods we considered perform much better on the 
SPSD matrices we considered than previous worst-case bounds (e.g., 11211 |39l 12810 would suggest. 
(That is, even the worst results correspond to single-digit approximation factors in relative scale.) This 
observation is intriguing, because the motivation of leverage score sampling (and, recall, that in this 
context random projections should be viewed as performing uniform random sampling in a randomly- 
rotated basis where the leverage scores have been approximately uniformized 1 14611 ) is very much tied 
to the Frobenius norm, and so there is no a priori reason to expect its good performance to extend to 
the spectral or trace norms. Motivated by this, we revisit the question of proving improved worst-case 
theoretical bounds in Section |4l 



Before describing these improved theoretical results, however, we address in Section 3.5 running 
time questions. After all, a naive implementation of sampling with exact leverage scores is slower than 
other methods (and much slower than uniform sampling). As shown below, by using the recently- 
developed approximation algorithm of II20II , not only does this approximation algorithm run in time 
comparable with random projections (for certain parameter settings), but it leads to approximations 
that soften the strong bias that the exact leverage scores provide toward the best rank-fc approximation 
to the matrix, thereby leading to improved reconstruction results in many cases. 

3.5. Reconstruction Accuracy of Leverage Score Approximation Algorithms. A naive view might 
assume that computing probabilities that permit leverage-based sampling requires an 0(n 3 ) computa- 
tion of the full SVD, or at least the full computation of a partial SVD, and thus that it would be much 
more expensive than recently-developed random projection methods. Indeed, an "exact" computation 
of the leverage scores with a QR decomposition or truncated SVD takes roughly 0(n 2 fc) time (and the 
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Input: A G R nxd (with SVD A = U£V r ), error parameter e e (0, 1/2]. 

Output: ti, i = 1, . . ., n, approximations to the leverage scores of A. 

(1) Let IIj e M r i xn be an SRFT with 

r x = Q(e" 2 (V / d+ \fkmflnd) 

(2) Compute IIjA e M riXd and its QR factorization n x A = QR. 

(3) Let n 2 e M dxr2 be a matrix of i.i.d. standard Gaussian random variables, where 

r 2 = ft(e" 2 lnn) . 

(4) Construct the product SI = AR _1 II 2 . 

II l|2 

(5) For i — 1, . . . ,n compute l i = 2 - 

Algorithm 1: Algorithm (originally Algorithm 1 in 112010 for approximating the leverage scores l i of 
an n x d matrix A, where n » d, to within a multiplicative factor of lie. The running time of the 
algorithm is 0(ndln(v / d + vlnri) + nde~ 2 \nn + d 2 e" 2 (v / d + v1rTn) 2 lnd). 



Input: A G R nxd , a rank parameter k, and an error parameter e e (0, 1/2]. 

Output: i = 1, . . . , n, approximations to the leverage scores of A filtered through its 
dominant dimension-?: subspace. 

(1) Construct II e M. dx2k with i.i.d. standard Gaussian entries. 

(2) Compute B = (AA r ) 9 An e R nx2k with 

In (l + + e \f\ V min {n,d}~ k) 
q ~ 21n(l + e/10)-l/2 ' 

(3) Approximate the leverage scores of B by calling Algorithm [l] with inputs B and e; let 
l i for i — 1, . . . , n be the outputs of Algorithm ol 



Algorithm 2: Algorithm (originally Algorithm 4 in 112010 for approximating the leverage scores (rel- 
ative to the best rank-fc approximation to A) of a general n x d matrix A with those of a matrix 
that is close by in the spectral norm (or the Frobenius norm if q = 0). This algorithm runs in time 
O(ndkq) + T 1 , where T 1 is the running time of Algorithm[l| 



running time results of Section 3.4 actually used this naive procedure). Recent work, however, has 
shown that relative-error approximations to all the statistical leverage scores can be computed more 
quickly than this exact algorithm [20]. Here, we implement and evaluate a version of this algorithm, 
and we evaluate it both in terms of running time and in terms of reconstruction quality on the di- 
verse suite of real data matrices we considered above. We note that ours is the first work to provide 
an empirical evaluation of an implementation of the leverage score approximation algorithms of [20], 
illustrating empirically the tradeoffs between cost and efficiency in a practical setting. 

3.5.1. Description of the Fast Approximation Algorithm of II20II . Algorithm [l] (which originally appeared 
as Algorithm 1 in [20]) takes as input an arbitrary nxd matrix A, where n » d, and it returns as output 
a 1 ± e approximation to all of the statistical leverage scores of the input matrix. The original algorithm 
of 1 12 Oil uses a subsampled Hadamard transform and requires r x to be somewhat larger than what we 
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Input: A e R nxd , a rank parameter k, and an iteration parameter q. 

Output: i e= 1, . . . , n, approximations to the leverage scores of A filtered through its 
dominant dimension-fc subspace. 

(1) Construct an SRHT matrix II e R dxr , where 

r> 36e" 2 [/fc+ ^81n(fcd)] 2 ln(fc) 

(2) Compute B = (AA T ) 9 An e M nxr , where q > is an integer. 

(3) Return the exact leverage scores of B. 



Algorithm 3: Algorithm for approximating the leverage scores (relative to the best rank-fc approxima- 
tion to A) of a general n x d matrix A with those of a matrix that is close by in the spectral norm. This 
is a modified version of Algorithm [2] in which the random projection is implemented with an SRFT 
rather than a Gaussian random matrix, and where the number of "iterations" q is prespecified. This 
algorithm runs in time 0(nd In r + ndrq + nr 2 ) since An can be computed in time 0(nd In r). 



state in Algorithm [T| That an SRFT with a smaller value of r 1 can be used instead is a consequence of 
the fact that Lemma 3 in [20] is also satisfied by an SRFT matrix with the given r x ; this is established 

in [lam. 

The running time of this algorithm, given in the caption of the algorithm, is roughly 0(nd In d) 
when d = Q(lnn). Thus Algorithm [l] generates relative-error approximations to the leverage scores 
of a tall and skinny matrix A in time o(nd 2 ), rather than the 0(nd 2 ) time that would be required to 
compute a QR decomposition or a thin SVD of the nxd matrix A. The basic idea behind Algorithm [I] 
is as follows. If we had a QR decomposition of A, then we could postmultiply A by the inverse of 
the "R" matrix to obtain an orthogonal matrix spanning the column space of A; and from this nxd 
orthogonal matrix, we could read off the leverage scores from the Euclidean norms of the rows. Of 
course, computing the QR decomposition would require 0(nd 2 ) time. To get around this, Algorithm [l] 
premultiplies A by a structured random projection n 1; computes a QR decomposition of n x A, and 
postmultiplies A by R -1 , i.e., the inverse of the "R" matrix from the QR decomposition of njA. Since 
n x is an SRFT, premultiplying by it takes roughly 0(nd In d) time. In addition, note that n x A needs 
to be post multiplied by a second random projection in order to compute all of the leverage scores in 
the allotted time; see [20] for details. This algorithm is simpler than the algorithm in which we are 
primarily interested that is applicable to square SPSD matrices, but we start with it since it illustrates 
the basic ideas of how our main algorithm works and since our main algorithm calls it as a subroutine. 
We note, however, that this algorithm is directly useful for approximating the leverage scores of Linear 
Kernel matrices A = XX r , when X is a tall and skinny matrix. 

Consider, next, Algorithm [2] (which originally appeared as Algorithm 4 in [20]), which takes as input 
an arbitrary nxd matrix A and a rank parameter k, and returns as output a 1 ± e approximation to all 
of the statistical leverage scores (relative to the best rank-fc approximation) of the input. An important 
technical point is that the problem of computing the leverage scores of a matrix relative to a low- 
dimensional space is ill-posed, essentially because the spectral gap between the fc and the (fc + l) st 
eigenvalues can be small, and thus Algorithm[2]actually computes approximations to the leverage scores 
of a matrix that is near to A in the spectral norm (or the Frobenius norm if q = 0). See II20II for details. 
Basically, this algorithm uses Gaussian sampling to find a matrix close to A in the Frobenius norm or 
spectral norm, and then it approximates the leverage scores of this matrix by using Algorithm [l] on the 
smaller, very rectangular matrix B. When A is square, as in our applications, Algorithm [2] is typically 
more costly than direct computation of the leverage scores, at least for dense matrices (but it does have 



28 



ALEX GITTENS AND MICHAEL W. MAHONEY 



the advantage that the number of iterations is bounded, independent of properties of the matrix, which 
is not true for typical iterative methods to compute low-rank approximations). 

Of greater practical interest is Algorithm [3] which is a modification of Algorithm [2] in which the 
Gaussian random projection is replaced with an SRFT. That is, Algorithm [3] uses an SRFT projection 
to find a matrix close by to A in the Frobenius norm or spectral norm (depending on the value of q), 
and then it exactly computes the leverage scores of this matrix. This improves the running time to 
0(n 2 ln(VT+ Vhm) + n 2 (v / fc+ v1nn) 2 ln(fc)q + n(v / fc+ vlnn) 4 ln 2 (fc)), which is o(n 2 fc) when q = 0. 
Thus an important point for Algorithm[3](as well as for Algorithm^ is the parameter q which describes 
the number of iterations. For q = iterations, we get an inexpensive Frobenius norm approximation; 
while for higher q, we get better spectral norm approximations that are more expensivd^This flexibility 
is of interest, as one may want to approximate the actual leverage scores accurately or one may simply 
want to find crude approximations useful for obtaining SPSD sketches with low reconstruction error. 

Finally, note that although choosing the number of iterations q as we did in Algorithm[2]is convenient 
for worst-case analysis, as a practical implementational matter it is easier either to choose q based on 
spectral gap information revealed during the running of the algorithm or to prespecify q to be a small 
integer, e.g., 2 or 3, before the algorithm runs. Both of these have an interpretation of accelerating the 
rate of decay of the spectrum with a power iteration, but they behave somewhat differently due to the 
different stopping conditions. Below, we consider both variants. 

3.5.2. Running Time Comparisons. Here, we describe the performances of the various random sam- 



pling and random projection low-rank sketches considered in Section 3.4 in terms of their running 
time, where the method that involves using the leverage scores to construct the importance sampling 
distribution is implemented both by computing the leverage scores "exactly" by calling a truncated 
SVD, as a black box, as well as computing them approximately by using one of several versions of 
Algorithm [3] Our running time results are presented in Figure [6] and Figure [7J 

We start with the results described in Figure [6] which shows the running times, as a function of 
I, for the low-rank approximations described in Section [3~4^ i.e., for column sampling uniformly at 
random without replacement; for column sampling according to the exact nonuniform leverage score 
probabilities; and for sketching using Gaussian and SRFT mixtures of the columns. Several observations 
are worth making about the results presented in this figure. 

• Uniform sampling is always less expensive and typically much less expensive than the other 
methods, while (with one minor exception) sampling according to the exact leverage scores is 
always the most expensive method. 

• For most matrices, using the SRFT is nearly as expensive as exact leverage score sampling. This 
is most true for the very sparse graph Laplacian Kernels, largely since the SRFT does not respect 
sparsity. The main exception to this is for the dense and relatively well-behaved Linear Kernels, 
where especially for large values of I the SRFT is quite fast and usually not too much more 
expensive than uniform sampling. 

• The "fast Fourier" methods underlying the SRFT can take advantage of the structure of the 
Linear Kernels to yield algorithms that are similar to Gaussian projections and much better 
than exact leverage score computation. Note that the reason that SRFT is worse than Gaussians 
here is that the matrices we are considering are not extremely large, and we are not considering 
very large values of the rank parameter. Extending in both those directions leads to Gaussian 
projections being slower than SRFT, as the trends in the figures clearly indicate. 

• Gaussian projections are not too much slower than uniform sampling for the extremely sparse 
Laplacian Kernels — this is due to the sparsity of the Laplacian Kernels, since Gaussian pro- 
jections can take advantage of the fast matrix-vector multiply, while the SRFT-based scheme 



Observe that since A is rectangular in Algorithms ^and^] we approximate the leverage scores of A with those of 
B = (AA r ) <! An; in particular the case q = corresponds to taking B = An. By way of contrast, when we use the power 
method to construct sketches of an SPSD matrix, we take C = A q S, so the case q = 1 corresponds to C = AS. 
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Figure 6. The times required to compute the (non-rank-restricted) SPSD sketches, as a 
function of the number of columns samples I for several data sets and two choices of 
the rank parameter k. 
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Figure 7. The times required to compute the (non-rank-restricted) approximate lever- 
age score-based SPSD sketches, as a function of the number of columns samples I for 
several data sets. 
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Figure 8. The running time of (non-rank-restricted) SPSD sketches computed using 
Algorithm [I] compared with that of other approximate leverage score-based SPSD 
sketches, as a function of the number of column samples £ for two Linear Kernel 
datasets. The parameters in Algorithm [l] were taken to be r 1 = e~ 2 ln(d5~ 1 )(v / d + 
vAnCnS- 1 )) 2 and r 2 = e" 2 (lnn + lnS" 1 ) with e = 1 and 5 = 1/10. 



A-CWtC r || 2 /||A-Ai,|| 2 




||A-CW;.C|| 2 /||A-A t | 



||A-CW*C|| 2 /||A-A t ||, 



20 40 60 SO 

£ (column samples) 
IIA-CWtCl^/IIA-A.H 




20 40 60 80 

£ (column samples) 
IIA-CWtcni./IIA-AJ. 



■ 


QR lev 
















A spec le' 




— 1— srfl 




-*-Alg 1 





I A A A ^ * 3 









20 40 60 80 
£ (column samples* 
||A-CWi ; C r ||,-/l|A- 


100 

A,|| f - 








20 40 60 80 
£ (column samples* 
||A-CW;C T ||„/||A- 


100 

A*|U 




QR lev 
A spec le' i 




— ^srft 
-*-Algl 





20 40 60 SO 

£ (column samples) 

(a) Protein, k = 10, 
non-rank-restricted 



'20 40 60 80 

£ (column samples) 

(b) Protein, fc = 10, 
rank-restricted 




£ (column samples) 
A-CW;c r ||,/|!A-A,.||, 





-#- QR lev 
















A spec le' 




— 1— srfl 




-*-Alg 1 







-#- QR lev 
















A spec le' 




— 1— srfl 
-tt-Algl 





£ (column samples 

(c) SNPs, fc = 5, 
non-rank-restricted 



£ (column samples 



(d) SNPs, fc = 5, 
rank-restricted 



Figure 9. The spectral, Frobenius, and trace norm errors (top to bottom, respectively, in 
each subfigure) of SPSD sketches computed using Algorithm [l] compared with those of 
other approximate leverage score-based sketching schemes, as a function of the number 
of columns samples I, for two Linear Kernel data sets. The parameters in Algorithm [I] 

were taken to be r a = e" 2 ln(d5" 1 )(v / d + 0n(n<5 -1 )) 2 and r 2 = e" 2 (lnn+ln5 _1 ) with 
e = l and 5 = 1/10. 
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cannot — but this advantage is lost for the (denser) Sparse RBF Kernels, to the extent that there 
is little running time improvement relative to the Dense RBF Kernels. In addition, Gaussian pro- 
jections are relatively slower, when compared to the SRFT and uniform sampling, for the Dense 
RBF Kernels than for the Linear Kernels, although both of those data sets are maximally dense. 

We next turn to the results described in Figure [7J which shows the running times, as a function of 
I, for several variants of approximate leverage-based sampling. For ease of comparison, the timings 
for uniform sampling ("unif") and exact leverage score sampling ("levscore") are depicted in Figure [TJ 
using the same shading as used in Figure |6j In addition to these two baselines, Figure [7] shows running 
time results for the following three variants of approximate leverage score sampling: "frob levscore" 
(which is Algorithm [3] with q = and r = 2k); "spec levscore" (Algorithm [3] with q = 4 and r = 2k); 
and "power". The "power" scheme is a version of Algorithm [3] where r = k and q is determined by 
monitoring the convergence of the leverage scores of A 2q+1 II and terminating when the change in 
the leverage scores between iterations, as measured in the infinity norm, is smaller than 10~ 2 . This is 
simply a version of subspace iteration with a convergence criterion appropriate for the task at hand. 
Since "frob levscore" requires one application of an SRFT, its timing results are depicted using the 
same shade as the SRFT timing results in Figure [6j (There are no other correspondences between the 
shadings in the two figures.) Several observations are worth making about the results presented in 
this figure. 

• These approximate leverage score-based algorithms can be orders of magnitude faster than 
exact leverage score computation; but, especially for "spec levscore" when q is not prespecified 
to be 2 or 3, they can even be somewhat slower. Exactly which is the case depends upon the 
properties of the matrix and the parameters used in the approximation algorithm, including 
especially the number of power iterations. 

• The "frob levscore" approximation method has running time comparable to the running time 
of the SRFT, which is expected, given that the computation of the SRFT is the theoretical 
bottleneck for the running time of the "frob levscore" algorithm. In particular, for larger values 
of I for Linear Kernels, "frob levscore" is not much slower than uniform sampling. 

• The "spec levscore" and "power" approximations with q > are more expensive than the q = 
"frob lev" approximation, which is a result of the relatively-expensive matrix-matrix multiplica- 
tion. For the Linear Kernels, both are much better than the exact leverage score computation, 
and for most other data at least "power" is somewhat less expensive than the exact leverage 
score computation. For example, this is particularly true for the Laplacian Kernels. 

Recall that the cost associated with these SPSD sketches is two-fold: first, the cost to construct the 
sample — by sampling columns uniformly at random, by computing a nonuniform importance sampling 
distribution, or by performing a random projection to uniformize the leverage scores; and second, 
the cost to construct the low-rank approximation from the sample. For uniform sampling, the latter 
step dominates the cost, while for more sophisticated methods the former step typically dominates the 
cost. The approximate leverage score sampling methods are still sufficiently expensive that the cost of 
computing the sampling probabilities still dominates the cost to construct the low-rank approximation. 

Finally, Algorithm [I] can be used to approximate quickly the leverage scores of matrices of the form 
A = XX T , when X e M" xd is a rectangular matrix of sufficent aspect ratio, and in such cases it is faster 
than Algorithm [3j Specifically, for the first dimensional reduction step in Algorithm [I] to be beneficial 
(i.e., to ensure r 1 < n), the condition n = f2(dlnd) is necessary; for the second dimensional reduction 
step to be beneficial (i.e., to ensure r 2 < d), the condition d = f2(lnn) must be satisfied. Figure [8] 
summarizes our main results for the run time of Algorithm [I] applied to rectangular matrices with 
n » d. Among other things, Figure [8] illustrates, using the Linear Kernel datasets Protein and SNPs 
(which satisfy these constraints), two points. 
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• Most importantly, the running time of Algorithm [T] on these rectangular matrices is faster than 
performing a QR decomposition on A and is comparable to applying a SRFT to A. This is 
expected, since the running time bottleneck for Algorithm [I] is the application of the SRFT. 

• In addition, the running time of Algorithm [I] is significantly faster than the other approximate 
leverage score algorithms. This too is expected, since these other algorithms are applied to A 
and ignore the rectangular structure of X. 

Figure [9] shows that these improved running time gains for Algorithm [I] can come at the cost of a slight 
loss in the reconstruction accuracy (relative to the exact computation of the leverage scores) of the 
low-rank approximations; the accuracy of the other approximate leverage score algorithms is discussed 
in the following subsection. 



3.5.3. Reconstruction Accuracy Results. Here, we describe the performances of the various low-rank 
approximations that use approximate leverage scores in terms of reconstruction accuracy for the data 
sets described in Section [3TTj The results are presented in Figure [To] through Figure 14 The setup for 
these results parallels that for the low-rank approximation results described in Section 3.4 and these 
figures parallel Figure [I] through Figure |5} To provide a baseline for the comparison, we also plot the 
previous reconstruction errors for sampling with the exact leverage scores as well as the uniform column 
sampling sketch. Several observations are worth making about the results presented in these figures. 

• For Laplacian Kernels, for the non-rank-restricted results, "frob levscore" is only slightly better 
than uniform sampling, while "power" and "spec levscore" are substantially better than uniform 
sampling. All of those methods also lead to better reconstruction results even than using the 
exact leverage scores (suggesting that some form of implicit regularization is taking place) : the 
reconstruction quality is higher for a given £ and, also, using approximate leverage scores does 
not lead to the saturation effect observed when using the exact leverage scores. 

• For Laplacian Kernels, for the rank-restricted results, the "frob levscore" results are similar to 
the exact leverage score results for I = k, but the quality degrades considerably as £ increases. 
On the other hand, "power" and "spec levscore" are much better than using the exact leverage 
scores when i — k, and are slightly better or only slightly worse as £ increases. 

• For the Linear Kernels, all the methods perform similarly in the non-rank-restricted case; while 
in the rank-restricted case, the methods that use approximate leverage scores tend to parallel 
the exact leverage score results, both when those get better and when those get worse with 
increasing £. 

• For both the dense and the sparse RBF data sets, for the non-rank-restricted case, the approx- 
imate leverage score algorithms tend to parallel the exact leverage score algorithm, and they 
are not substantially better. In particular, both "power" and "spec levscore" tend to saturate 
when the exact method saturates, but in those cases "frob levscore" tends not to saturate. 

• For both the dense and sparse RBF data sets, for the rank-restricted case, the results depend on 
the value of the cr width parameter. When a is larger and the matrices are more homogeneous, 
all the methods tend to parallel each other (although WineS is an exception). When cr is 
smaller, "frob levscore" is generally better than uniform sampling but worse than the other 
methods for £ = k, but it degrades with increasing £; while both "power" and "spec levscore" 
tend to parallel the results for the exact leverage scores. 

Note that the difference between different approximate leverage score algorithms often corresponds to 
a difference in the spectral gaps of the corresponding matrices. From Table [5] if we fix k and use the 
approximate leverage scores filtered through rank k to form a Nystrom approximation to A, the accuracy 
of that approximation has a strong dependence on the spectral gap of A at rank k, as measured by — — . 
In general, the larger the spectral gap, the more accurate the approximation. This phenomena can 
also be understood in terms of the convergence of the approximate leverage scores: the approximation 
algorithms (in Algorithm[2]and Algorithm[3]) are essentially truncated versions of the subspace iteration 
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Figure 10. The spectral, Frobenius, and trace norm errors (top to bottom, respectively, 
in each subfigure) of several (non-rank-restricted in top panels and rank-restricted in 
bottom panels) approximate leverage score-based SPSD sketches, as a function of the 
number of columns samples I, for the GR and HEP Laplacian data sets, with two choices 
of the rank parameter k. 
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Figure 11. The spectral, Frobenius, and trace norm errors (top to bottom, respectively, 
in each subfigure) of several (non-rank-restricted in top panels and rank-restricted in 
bottom panels) approximate leverage score-based SPSD sketches, as a function of the 
number of columns samples I, for the Enron and Gnutella Laplacian data sets, with two 
choices of the rank parameter k. 
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Figure 12. The spectral, Frobenius, and trace norm errors (top to bottom, respectively, 
in each subfigure) of several (non-rank-restricted in top panels and rank-restricted in 
bottom panels) approximate leverage score-based SPSD sketches, as a function of the 
number of columns samples I, for the Linear Kernel data sets 
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Figure 13. The spectral, Frobenius, and trace norm errors (top to bottom, respectively, 
in each subfigure) of several (non-rank-restricted in top panels and rank-restricted in 
bottom panels) approximate leverage score-based SPSD sketches, as a function of the 
number of columns samples I, for several dense RBF data sets. 
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Figure 14. The spectral, Frobenius, and trace norm errors (top to bottom, respectively, 
in each subfigure) of several (non-rank-restricted in top panels and rank-restricted in 
bottom panels) approximate leverage score-based SPSD sketches, as a function of the 
number of columns samples I, for several sparse RBF data sets. 
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method for computing the top k eigenvectors of A. It is a classical result that the spectral gap determines 
the rate of convergence of the subspace iteration process to the desired eigenvectors: the larger it is, the 
fewer iterations of the process are required to get accurate approximations of the top eigenvectors. It 
follows immediately that the larger the spectral gap, the more accurate the approximate leverage scores 
generated by these approximation algorithms are. Our empirical results illustrate the complexities and 
subtle consequences of these properties in realistic machine learning applications of even modestly- 
large size. 

3.5.4. Summary of Leverage Score Approximation Algorithms. Before proceeding, there are several sum- 
mary observations that we can make about the running time and reconstruction quality of approximate 
leverage score sampling algorithms for the data sets we have considered. 

• The running time of computing the exact leverage scores is generally much worse than that of 
uniform sampling and both SRFT-based and Gaussian-based random projection methods. 

• The running time of computing approximations to the leverage scores can, with appropriate 
choice of parameters, be much faster than the exact computation of the leverage scores; and, 
especially for "frob levscore," can be comparable to the running time of the random projection 
(SRFT or Gaussian) used in the leverage score approximation algorithm. For the methods that 
involve q > iterations to compute stronger approximations to the leverage scores, the running 
time can vary considerably depending on details of the stopping condition. 

• The leverage scores computed by the "frob levscore" procedure are typically very different 
than the "exact" leverage scores, but they are leverage scores for a low-rank space that is 
near the best rank-fc approximation to the matrix. This is often sufficient for good low-rank 
approximation, although the reconstruction accuracy can degrade in the rank-restricted cases 
as I is increased. 

• The approximate leverage scores computed from "power" and "spec levscore" approach those 
of the exact leverage scores, as q is increased; and they obtain reconstruction accuracy that is 
no worse, and in many cases is better, than that obtained by the exact leverage scores. This 
suggests that, by not fitting exactly to the empirical statistical leverage scores, we are observing 
a form of implicit regularization. 

• The running time of Algorithm [TJ when applied to "tall" matrices for which n » d, is faster 
than the running time of performing a QR decomposition of the matrix A; and it is compa- 
rable to the running time of applying a random projection to A (which is the computational 
bottleneck of applying Algorithm [TJ. Thus, in particular, one could use this algorithm to com- 
pute approximations to the leverage scores to obtain a sketch that provides a relative-error 
approximation to a least-squares problem involving A 11221 1231 14611 ; or one could use the sketch 
thereby obtained as a preconditioner to an iterative method to solve the least-squares problem, 
in a manner analogous to how Blendenpik or LSRN do so with a random projection fl3] 15011 . 

Previous work has showed that one can implement random projection algorithms to provide low-rank 
approximations with error comparable to that of the SVD in less time than state-of-the art Krylov solvers 
and other "exact" numerical methods 11321 146|| . Our empirical results show that these random projec- 
tion algorithms can be used in two complementary ways to approximate SPSD matrices of interest in 
machine learning: first, they can be used directly to compute a projection-based low-rank approxima- 
tion; and second, they can be used to compute approximations to the leverage scores, which can be 
used to compute a sampling-based low-rank approximation. With the right choice of parameters, the 
two complementary approaches have roughly comparable running times, and neither one dominates 
the other in terms of reconstruction accuracy. 

3.6. Projection-based Sketches. Finally, for completeness, we consider the performance of the two 
projection-based SPSD sketches proposed in II32II . and we show how they perform when compared 
with the sketches we have considered. Recall that the idea of these sketches is to construct low-rank 



40 



ALEX GITTENS AND MICHAEL W. MAHONEY 



HA-CWtC^/HA-Ail 



||A-CWtC|M|A-A»||, 



A - CW'C'lli/IIA - Akh 



||A-CWtC T || 2 /||A-A t | 




40 60 80 100 !2Q 140 160 180 ' 

f (column samples) 
||A-CWtC r ||./||A-A,||, 




40 60 80 1 00 ' 20 40 60 80 100 120 140 160 ' 20 40 60 80 100 120 140 160 

£ (column samples) £ (column samples) £ (column samples) 

||A - CWtC r ||,/||A - A,||, ||A - CWtC J ||./||A - A,||, ||A - CW t C r || > /|| A - A,||. 



40 60 80 100 120 140 160 180 
I (column samples) 




(a) Gnutella, k = 20 



100 ' 20 40 60 80 100 120 140 160 ' 20 40 60 80 100 120 140 160 
£ (column samples) £ (column samples) I (column samples) 

(b) Dexter, fc = 8 (c) AbaloneD, a = .15, fc = 20 (d) WineS, a = 1, k = 20 



Figure 15. The spectral, Frobenius, and trace norm errors (top to bottom, respectively, 
in each subfigure) of several non-rank-restricted SPSD sketches, including the pinched 
and prolonged sketches, as a function of the number of columns samples I, for several 
datasets. Pinched and prolonged sketches, respectively indicated by "pn." and "pr.", are 
defined in Equations ((£]) and ( flO"] ). 



approximations by forming an approximate basis Q for the top eigenspace of A and then restricting A 
to that eigenspace. In more detail, given a sketching matrix S, form the matrix Y = AS and take the 
QR decomposition of Y to obtain Q, a matrix with orthonormal columns. The first sketch, which we 
eponymously refer to as the pinched sketch, is simply A pinched to the space spanned by Q : 

(9) Q(Q r AQ)Q r . 

The second sketch, which we refer to as the prolonged sketch, is 

(10) AQ(Q T AQ) t Q r A. 

It is clear that the prolonged sketch can be constructed using our SPSD Sketching Model by taking Q 
as the sketching matrix. In fact, a stronger statement can be made. As shown in [28], and as stated in 
Lemma [l] below, it is the case, for any sketching matrix X, that when C = AX and W = X r AX, 

CW f C T =A 1/2 P A i/2 X A 1/2 . 

By considering the two choices X = AS and X = Q, we see that in fact the prolonged sketch is exactly 
the sketch obtained by applying the power method with q = 2 : 

AQ(Q r AQ)Q r A = A 1/2 P a1/2q A 1/2 

- A^ 2 P ,„ A 1 / 2 
A 1 / 2 (AS) 

= A 2 S(S T A 3 S) t S 7 A 2 . 

It follows that the bounds we provide in Section [4] on the performance of sketches obtained using the 
power method pertain also to prolonged sketches. 
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In Figure [15] we compare the empirical performances of several of the SPSD sketches considered 
earlier with their pinched and prolonged variants. Specifically, we plot the errors of pinched and pro- 
longed sketches for several choices of sketching matrices — corresponding to uniform column sampling, 
gaussian column mixtures, and SRFT-based column mixtures — along with the errors of non-pinched, 
non-prolonged sketches constructing using the same choices of S. In the interest of brevity, we provide 
results only for several of the datasets listed in Table [4j and we consider only the nonfixed-rank variants 
of the sketches. 



Some trends are clear from Figure 15 



• In the spectral norm, the prolonged sketches are considerably more accurate than the pinched 
and standard sketches for all the datasets considered. Without exception, the prolonged Gauss- 
ian and SRFT column-mixture sketches are the most accurate in the spectral norm, of all the 
sketches considered. Only in the case of the Dexter Linear Kernel is the prolonged uniformly 
column-sampled sketch nearly as accurate in the spectral norm as the prolonged Gaussian and 
SRFT sketches. To a lesser extent, the prolonged sketches are also more accurate in the Frobe- 
nius and trace norms than the other sketches considered. The increased Frobenius and trace 
norm accuracy is particularly notable for the two RBF Kernel datasets; again, the prolonged 
Gaussian and SRFT sketches are considerably more accurate than the prolonged uniformly 
column-sampled sketches. 

• After the prolonged sketches, the pinched Gaussian and SRFT column-mixture sketches ex- 
hibit the least spectral, Frobenius, and trace norm errors. Again, however, we see that the 
pinched uniformly column-sampled sketches are considerably less accurate than the pinched 
Gaussian and SRFT column-mixture sketches. Particularly in the spectral and Frobenius norms, 
the pinched uniformly column-sampled sketches are not any more accurate than the basic uni- 
formly column-sampled sketches. 

From these considerations, it seems evident that the benefits of pinched and prolonged sketches are 
most prominent when the spectral norm is the error metric, or when the dataset is an RBF Kernel. 
In particular, pinched and prolonged sketches are not significantly more accurate (than the sketches 
considered in the previous subsections) in the Frobenius and trace norms for any of the datasets con- 
sidered. 



It is also evident from Figure 15 that the pinched sketches often have a much slighter increase in 
accuracy over the basic sketches than do the prolonged sketches. To understand why the pinched 
sketches are less accurate than the prolonged sketches, observe that the pinched sketches satisfy 

Q(Q r AQ)Q r =P AS AP AS 

= (P AS A 1 / 2 )(A 1 / 2 P AS ), 

while, as noted above, the prolonged sketches can be written in the form 

AQ(Q T AQ) t Q T A = (A 1 / 2 P A 3 /2s )(P A 3/2 S A 1 / 2 ). 

Thus, pinched and prolonged sketches approximate the square root of A by projecting, respectively, 
onto the ranges of AS and A 3//2 S. The spectral decay present in A is increased when A is raised to a 
power larger than one; consequently, the range of A 3 ^ 2 S is more biased towards the top fc-dimensional 
invariant subspace of A than is the range of AS. It follows that the approximate square root used to 
construct the prolonged sketches more accurately captures the top fc-dimensional subspace of A than 
does that used to construct the pinched sketches. 

4. Theoretical Aspects of SPSD Low-rank Approximation 

In this section, we present our main theoretical results, which consist of a suite of bounds on the 
quality of low-rank approximation under several different sketching methods. As mentioned above, 
these were motivated by our empirical observation that all of the sampling and projection methods we 
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considered perform much better on the SPSD matrices we considered than previous worst-case bounds 



(e.g., 11211 [391 [2811 ) would suggest. We start in Section 4.1 with deterministic structural conditions for 



the spectral, Frobenius, and trace norms; and then in Section 4.2 we use these results to provide our 
bounds for several random sampling and random projection procedures. 

4.1. Deterministic Error Bounds for Low-rank SPSD Approximation. In this section, we present 
three theorems that provide error bounds for the spectral, Frobenius, and trace norm approximation 
errors under the SPSD Sketching Model of Section |272| These are provided in Sections 4.1.1 4.1.3 



and 4.1.5[ respectively, and they are followed by several more general remarks in Section 4.1.6 Note 
that these bounds hold for any, e.g., deterministic or randomized, sketching matrix S. Thus, e.g., one 
could use them to check, in an a posteriori manner, the quality of a sketching method for which one 



cannot establish an a priori bound. Rather than doing this, we use these results (in Section 4.2 below) 



to derive a priori bounds for when the sketching operation consists of common random sampling and 
random projection algorithms. 

Our results are based on the fact, established in II28II . that approximations which satisfy our SPSD 
Sketching Model can be written in terms of a projection onto a subspace of the range of the square root 
of the matrix being approximated. The following fact appears in the proof of Proposition 1 in [28]. 



Lemma 1 ([28]). Let A be an SPSD matrix and S be a conformal sketching matrix. Then when C 
and W = S r AS, the corresponding low-rank SPSD approximation satisfies 

CW f C T =A 1/2 P A i/2 S A 1/2 . 



AS 



4.1.1. Spectral Norm Bounds. We start with a bound on the spectral norm of the residual error. Al- 
though this result is trivial to prove, given prior work, it highlights several properties that we use in the 
analysis of our subsequent results. 

Theorem 1. Let Abe an n x n SPSD matrix with eigenvalue decomposition partitioned as in Equation ([!]), 
S be a sketching matrix of size n x I, qbe a positive integer, and 1^ and f2 2 be as defined in Equation ([3]). 
Then when C = A q S and W = S r A 2g_1 S, the corresponding low-rank SPSD approximation satisfies 



Ia-cW'c 7 '^ < ||s 2 || 2 + 



,q-l/2 



2/(2q-l) 

s 

2 



assuming ft 1 has full row rank. 

Proof. Apply Lemma[l]with the sampling matrix S' = A 9_1 S (where, recall, q > 1) to see that 



CW J( C T =A 1/2 P A ,-i/2 S A 1/2 . 



It follows that 
(11) 



A-CW'C 



a" 2 ('-V'T-'s) aI ' 2 



Next, recall that f2; = U ; r S and that A 1 / 2 has eigenvalue decomposition A = \]Y}I 2 \] T , where 



U= (Uj U 2 ) and S 1 



12 



,1/2 



,1/2 



It can be shown ([32, Theorems 9.1 and 9.2]) that, because ft 1 has full row rank, 



(12) 



Al/2 (^-V 2 ) 2 ^)^ 



j A l/2 


> 


(zf) 25 " 1 


2 

+ 








2 



2q-l 



l/(2q-l) 
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Equations ( fTT] ) and ( [12] ) imply that 
IIa- CW t C r || 



n q-l/2 



+ 



2A l/(2q-l) 



^ F2 2 



2 

2/(2q-l) 



The latter inequality follows from the fact that the 2q — 1 radical function is subadditive when q > 1 

2 " '|2q-l 



and the identity 



= s 



2 2 



. This establishes the stated bound. 



□ 



Remark. The assumption that ft 1 has full row rank is very non-trivial. It is, however, satisfied by our 
algorithms below. See Section 4.1.6| for more details on this point. 

Remark. The proof of Theorem [T] proceeds in two steps. The first step relates low-rank approximation 
of an SPSD matrix A under the SPSD Sketching Model of Section 2.2 to column sketching (e.g., sampling 
or projecting) from the square-root of A. A weaker relation of this type was used in [21 J, but the 
stronger form that we use here in Equation (11) was first proved in [28]. The second step is to 



use a deterministic structural result that holds for sampling/projecting from an arbitrary matrix. The 



structural bound of the form of Equation (12) was originally proven for q = 1 in H14II , where it was 
applied to the Column Subset Selection Problem. The bound was subsequently improved in [32], 
where it was applied to a random projection algorithm and extended to apply when q > 1. Although 
the analyses of our next two results are more complicated, they follow the same high-level two-step 
approach. 

Before proceeding with the analogous Frobenius and trace norm bounds, we pause to describe a 
geometric interpretation of this result. 



4.1.2. A geometric interpretation of the sketching interaction matrix. It is evident from the bound in The- 
orem [l] that the smaller the spectral norm of the sketching interaction matrix fi 2 fij, the more effective 
S is as a sketching matrix. If, additionally, the columns of S are orthonormal, we can give this norm a 
natural geometric interpretation as the tangent of the largest angle between the spaces spanned by S 
and Uj. 

To verify our claim, we first recall the definition of the sine between the range spaces of two matrices 
M x and M 2 : 



sin 2 (M 1 ,M 2 )= ||(I-P Mi )P M2 )| 



2 ' 



Note that this quantity is not symmetric: it measures how well the range of M t captures that of M 2 11291 
Chapter 12]. Since Ux and S (by assumption here) have orthonormal columns, we see that 

sin 2 (S,U 1 ) = ||(I-SS T )U 1 U[||2 

= ||u[(i-ss r )u 1 || 2 
= ||i-u[ss T u 1 || 2 

= 1 - A fc (U[SS T U!) 



1 - 



The second to last equality holds because U[S has k rows and we assumed it has full row rank. Accord- 
ingly, it follows that 

sin 2 (S,U!) 



tan z (S,U!) = 



1 -sin 2 (S,U!) 



n 



1. 
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Now observe that 



n 2 n' 



|(s r u 1 ) 1 's T u 2 uJs(u[s) t 



= (S' 11^(1- S'U a UjSXU{S) T 



The second to last equality holds because of the fact that, for any matrix M, 

Hm^I - MM r )(M r ) t || 2 = ||M' r ||2 - 1; 

this identity can be established with a routine SVD argument. 

Thus, when S has orthonormal columns and U^S has full row-rank, |n 2 n 1 |L is the tangent of the 
largest angle between the range of S and the eigenspace spanned by U 1 . If U^S does not have full 
row-rank, then our derivation above shows that sin 2 (S, Uj) = 1, meaning that there is a vector in the 
eigenspace spanned by U a which has no component in the space spanned by the sketching matrix S. 

We note that tan(S, Uj) also arises in the classical bounds on the convergence of the orthogonal 
iteration algorithm for approximating invariant subspaces of a matrix (see, e.g. [29, Theorem 8.2.2]). 

4.1.3. Frobenius Norm Bounds. Next, we state and prove the following bound on the Frobenius norm 
of the residual error. The proof parallels that for the spectral norm bound, in that we divide it into two 
analogous parts, but the analysis is somewhat more complex. 

The multiplicative eigengap y = A fc+1 (A)/A fc (A) that appears in the statement of this theorem pre- 
dicts the effect of using the power method when constructing sketches. Specifically, the additional 
errors of sketches constructed using C = A q S are at least a factor of times smaller than those 
constructed using C = AS. 

Theorem 2. Let Abe an n x n SPSD matrix with eigenvalue decomposition partitioned as in Equation ([!]), 
S be a sketching matrix of size n x I, qbe a positive integer, f2 x and ft 2 be as defined in Equation ([3]), and 
define 

_ A fc+1 (A) 
Y A fc (A) " 

Then when C = A 9 S and W = S T A 2q ~ 1 S, the corresponding low-rank SPSD approximation satisfies 

,1/2, 



A-CW'C 



, + r 



q-l 



2Tr(S 2 )+ r ' 



q-l 



assuming ft 1 has full row rank. 

Proof. Apply Lemma[l]with the sampling matrix S' = A 9_1 S to see that 



CVJ Jf C T =A 1/2 P A ,-i/2 S A 



1/2 



It follows that 



•cw t c T |L = 

IIF 



A 1/2 (l-P A ,-v 2s )A 1/2 



To bound this quantity, we first use the unitary invariance of the Frobenius norm and the fact that 



= UP 



S9-i/2ur s 



to obtain 

Then we take 
(13) 



E := 



^(I-Pa^A^ 



il/2 



z = s^- 1/2 u r s^s- (q - 1/2) = 



E9-l/2u T S 



1/2 
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where I 



r>kxk 



and F 



t n-kxk is givgn by F _ s ? 1/2 n 2 ri[T. 1 ( ~ q 1/2) . The latter equality in Equa- 



tion ( |13[ ) holds because of our assumption that flj has full row rank. Since the range of Z is contained 
in the range of £ q - 1/2 U T S, 

2 

E <" 



] 1/2 (I-P Z )F 1/2 

By construction, Z has full column rank, thus Z(Z r Z)~ 1/ ' 2 is an orthonormal basis for the span of Z, and 



I-P z = I-Z(Z J Z) _i Z J =1 



-rpVl + F^F)-^! V T ) 



(14) 

This implies that 



E < 



(15) 



il/2 



Sj /2 (I-(I + F T F)- I )si 



I-(I + F r F) _1 -(I + F T F)- 1 F T 
-F(I + F r F) _1 I-F(I + F T F)- 1 F r 



I-(I + F r F) _1 -(I + F r F) _1 F r 
-F(I + F r F) 1 I-F(I + F r F)- 1 F r 



-,1/2 



+ 2 



Sj 72 (I - F(I + F T F)- 1 F T ) S 2 

:=r 1 + r 2 + r 3 . 



F 

ItjTW/2 



sJ /2 (I + F r F)- 1 F T S2 /2 



Next, we provide bounds for T lr T 2 , and T 3 . Using the fact that 0^1 — F(I + F r F) 1 F r ^ I, we can 
bound T 3 with 

r 3 <||s 2 ||p. 

Likewise, the fact that I — (I + F r F) _1 ^ F T F (easily seen with an SVD) implies that we can bound T 1 as 



s i/2 F r FS i/2 



< 



FS 



1/2 



FS 



1/2 



sf 1/2 n 2 ^s~ (9_1) 



sf 1/2 n 2 n{s7 (q - 1) 



-(q-1) 



= c|M 2 |M| 2 )«&- 

A fc (A) J 



i) 



^2^1 



4(q-l) 



2^2^ 2 0-^ 



2^2^ 2 0-^ 



We proceed to bound T 2 by using the estimate 



(16) 



r 2 <2 



sJ^Cl + F'Fr'F 



IttT 



,1/2 



To develop the term involving a spectral norm, observe that for any SPSD matrix M with eigenvalue 
decomposition M = VDV r , 

(I + M) _1 M(I + M)" 1 = (W r + VDV T )" 1 VDV r (W T + VDV 7 ")" 1 
= V(I + D) _1 D(I + D) _1 V r 
-<! VDV r = M. 
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It follows that 



< 



< 



Si (I + F J F) _1 F J F(I + F J F)"^^ 

= fs! /z 2 

2 

(q-D 



s i/2 F r FS i/2 



yj-i/2. 



J 2 ^2^iSi 

2 
2 



-1 



(q-D 



,1/2 
J 2 "2"l 



A fc (A) J 

Using this estimate in Equation ( |16| ), we conclude that 



2(q-l) 



r 2 <2 



A fc (A) J 



,1/2 



Combining our estimates for 7\, T 2 , and T 3 with Equation Q15J ) gives 



A 1 / 2 (l-P A ,- 1/2s )A 



1/2 



< ||s 2 || F + 



A fc+1 (A) yfa-D 

A fc (A) J 



,1/2 



+ 



At+i( A )A 



2(q-l) 



^2^ r^2^^ 



f v A fc (A) ; y 

The claimed bound follows by identifying y and applying the subadditivity of the square-root function: 

A- 1 



E<||s 2 || F + r y 



,1/2, 

J 2 "2"l 



2Tr(S 2 )+ r 



q-l 



,1/2 

J 2 »«2*'l 



□ 



Remark. The quality of approximation guarantee provided by Theorem [2] depends on the quantities 

; these quantities reflect the extent to which the sketching matrix is 



S2^ 2 i^i 



and 



So^ 2 0-j 



aligned with the eigenspaces of A. As we will see in Section 4.2 the degree to which we can bound 
each of these for different sketching procedures is slightly different. The dependence on y captures the 
facts that the power method is effective only when there is spectral decay, and that larger gaps between 
the k and k + 1 eigenvalues lead to smaller errors when the power method is used. 
As before, we pause to describe a geometric interpretation of this result. 

4.1.4. Another geometric interpretation of the sketching interaction matrix. Just as the spectral norm of 
the spectral interaction matrix is the tangent of the largest angle between the range of the sketching 
matrix and the dominant fc-dimensional eigenspace of A, the Frobenius norm of the spectral interaction 
matrix has a geometric interpretation. To see this, recall that the principal angles between the ranges 
of the matrices S and Ui are defined recursively by 



cos(0 ; ) : 



T 

u; v, 



max 

l|u|| 2 =l 



max 

l|v|| 2 =l 



T 

v u; 



ue5?(S) ve^CUj) 

U i ±U 1 ,...,U i _ 1 V i ±V 1 ,...,Vj_ 



satisfy < 1 < . . . < k < n/2; and further, since S and Ui have orthonormal columns, cos(0j) = 
cr ; (UiS) [29, Chapter 12]. Specifically, when S has orthonormal columns and U^S has full row-rank, 
||fi 2 f2||| F is the sum of the squared tangents of the principal angles between the range of the sketch- 
ing matrix and the dominant fc-dimensional eigenspace of A. Thus, this quantity is a more stringent 
measure of how well the sketching matrix captures the dominant eigenspaces of A. 
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The proof of this claim hinges on the fact that, for any matrix M, 

Tr (m + (I - MM T )(M T ) i ) = Tr ((IV^M) 1 ') - rank(M), 
as can be readily verified with an SVD argument. From this observation, we have that 

= Tr ((S r U 1 ) t S r U 2 u£s(U[S) t ) 
= Tr ((S T U 1 ) t (I - S r U 1 U[S)(U[S) t ) 
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n 2 n[ 



-k 



= Tr((n 1 n[) 1 ') -k 

= LL^r'Ks) - 1) = ^(cos-^ej - 1) 

= ^ =i tan 2 (0 ! )- 

Remark. To obtain a greater understanding of the additional error term in Theorem [2| assume that S 



is a particularly effective sketching matrix, so that 



n 2 n[ 



0(1). Then 



= 0(||S 2 || 2 /2 ) and V2Tr(S 2 ) + 



a/2 

J 2 »'2 s 'l 



-°Oi*n. 



so the theorem guarantees that the additional error is on the order of ^/||s 2 || 2 ||e 2 ||*-. This * s an u PP er 
bound on the optimal Frobenius error: 



j 2||f 



< 



vPA || S 2| 



We see, in particular, that if the residual spectrum is flat, i.e. A fc+1 (A) 
holds and the additional error is on the scale of the optimal error. 



A n (A), then equality 



4.1.5. Trace Norm Bounds. Finally, we state and prove the following bound on the trace norm of the 
residual error. The proof method is analogous to that for the spectral and Frobenius norm bounds. 

As in the case of the Frobenius norm error, we see that the multiplicative eigengap y = Aj. +1 (A)/A fc (A) 
predicts the effect of using the power method when constructing sketches: the additional errors of 
sketches constructed using C = A 9 S are a factor of y 2 ^ -1 ) times smaller than the additional errrors of 
those constructed using C = AS. 

Theorem 3. Let Abe an n x n SPSD matrix with eigenvalue decomposition partitioned as in Equation ([l]), 
S be a sketching matrix of size n x I, qbe a positive integer, f2 x and f2 2 be as defined in Equation ([3]), and 
define 

_ Afc+iCA) 
Y A fc (A) • 

Then when C = A q S and W = S T A 2q ~ 1 S, the corresponding low-rank SPSD approximation satisfies 

2 



A-CW T C T <Tr(S 2 ) +r 



2(<2-l) 



2^2^ J^ 2 1^-^ 



assuming fti has full row rank. 

Proof. Since A — CW T C r = A 1 ' 2 (I — P A9 -i/2 S )A 1//2 b 0, its trace norm simplifies to its trace. Thus 

||A - CW^IL = Tr (A - CW t C r ) = Tr (s 1/2 (i - P s? -i/2 S ) S 1/2 ) 
< Tr (S 1/2 (I - P Z )S 1/2 ) , 
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where Z = is defined in Equation ( |13D . The expression for I — P z given in Equation ( |14| ) im- 
plies that 

Tr (S 1/2 (I - P z )£ 1/2 ) = Tr (s} /2 (I - (I + F r F)" 1 )I]J /2 ) + Tr (z 2 /2 (I - F(I + F r F)" ^jS^ 2 ) . 

Recall the estimate I - (I + F T F) _1 ^ F r F and the basic estimate I - F(I + F T F)" 1 F T r< I. Together these 
imply that 

Tr (S 1/2 (I - P Z )S 1/2 ) < Tr (E] /2 F r F£i /2 ) + Tr (S 2 ) 



= Tr(S 2 ) + 
<Tr(S 2 ) + 
= Tr (S 2 ) + T 



Cq-D 



2(q-l) 



-(q-1) 



-.1/2 

J 2 J4 2 J4 1 



2 
F 

^ 2 ^ fi 2 Oj 



The first equality follows from substituting the definition of F and identifying the squared Frobenius 
norm. The last equality follows from identifying j. We have established the claimed bound. 

□ 

Remark. Since the identity ||X|| 2 = ||xx r ||^ holds for any matrix X, the squared Frobenius norm term 
present in the deterministic error bound for the trace norm error is on the scale of II S 2 |L when II n 2 fi| IL 
is 0(1). 

4.1.6. Additional Remarks on Our Deterministic Structural Results. Before applying these deterministic 
structural results in particular randomized algorithmic settings, we pause to make several additional 
remarks about these three theorems. 

First, for some randomized sampling schemes, it may be difficult to obtain a sharp bound on ||S7 2 S7| ||^ 
for E, = 2, F. In these situations, the bounds on the excess error supplied by Theorems [I] [2] and [3] may 



be quite pessimistic. On the other hand, since A — CW'C r = A 1 ' 2 (I — P 



(AV2) 2 



i s )A 1/ , it follows that 



^ A — CW'C ^ A. This implies that the errors of any approximation generated used the SPSD 
Sketching Model, deterministic or randomized, satisfy at least the crude bound ||a — CW^C r ||^ < HAH^. 

Second, we emphasize that these theorems are deterministic structural results that bound the ad- 
ditional error (beyond that of the optimal rank-fc approximation) of low-rank approximations which 
follow our SPSD sketching model. That is, there is no randomness in their statement or analysis. In 
particular, these bounds hold for deterministic as well as randomized sketching matrices S. In the latter 
case, the randomness enters only through S, and one needs to show that the condition that fti has full 
row rank is satisfied with high probability; conditioned on this, the quality of the bound is determined 
by terms that depend on how the sketching matrix interacts with the subspace structure of the matrix A. 

In particular, we remind the reader that (although it is beyond the scope of this paper to explore this 
point in detail) these deterministic structural results could be used to check, in an a posteriori manner, 
the quality of a sketching method for which one cannot establish an a priori bound. 

Third, we also emphasize that the assumption that ft 1 has full row rank (equivalently that tan(S, Uj) < 
oo) is very non-trivial; and that it is false, in worst-case at least and for non-trivial parameter values, 
for common sketching methods such as uniform sampling. To see that some version of leverage-based 
sampling is needed to ensure this condition, recall that U[ Ui = I and thus that f^fi^ = U[SS t U;l can 
be viewed as approximating I with a small number of rank-1 components of U^Uj. The condition that 
fij has full row rank is equivalent to ||u^U 1 — U^SS 7 !^ || 2 < 1. Work on approximating the product of 
matrices by random sampling shows that to obtain non-trivial bounds one must sample with respect to 
the norm of the rank-1 components H19II , which here (since we are approximating the product of two 
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orthogonal matrices) equal the statistical leverage scores. From this perspective, random projections 
satisfy this condition since (informally) they rotate to a random basis where the leverage scores of the 
rotated matrix are approximately uniform and thus where uniform sampling is appropriate 11231 14611 . 

Finally, as observed recently [01, methods that use knowledge of a matrix square root $ (i.e., a <E> 
such that A = <E><E> r ) typically lead to f2(n 2 ) complexity. An important feature of our approach is that 
we only use the matrix square root implicitly — that is, inside the analysis, and not in the statement of 
the algorithm — and thus we do not incur any such cost. 

4.2. Stochastic Error Bounds for Low-rank SPSD Approximation. In this section, we apply the three 



theorems from Section 4.1 to bound the reconstruction errors for several random sampling and random 
projection methods that conform to our SPSD Sketching Model. In particular, we consider two variants 
of random sampling and two variants of random projections: sampling columns according to an impor- 



tance sampling distribution that depends on the statistical leverage scores (in Section 4.2.1| ); randomly 



projecting by using subsampled randomized Fourier transformations (in Section 4.2.2); randomly pro 



jecting by uniformly sampling from Gaussian mixtures of the columns (in Section 4.2.3); and, finally, 



sampling columns uniformly at random (in Section 4.2.4) 



The results are presented for the general case of SPSD sketches constructed using the power method, 
i.e., sketches constructed using C = A 9 S for a positive integer q > 1. The additive errors of these 
sketches decrease proportionally to the number of iterations q, where the constant of proportionality 
is given by the multiplicative eigengap y = A fc+1 (A)/A fc (A). Accordingly, the bounds involve the terms 
y q_1 and f 2 W-l). The bounds simplify considerably when q = 1 (i.e., when there are no additional 
iterations) or 7 = 1 (i.e., when there is no eigengap). In either of these cases, the terms and 
y2(q-i) a u become the constant 1. 

Before establishing these results, we pause here to provide a brief review of running time issues, 
some of which were addressed empirically in Section [3j The computational bottleneck for random 
sampling algorithms (except for uniform sampling that we address in Section 4.2.4| which is trivial 



to implement) is often the exact or approximate computation of the importance sampling distribution 
with respect to which one samples; and the computational bottleneck for random projection methods 
is often the implementation of the random projection. For example, if the sketching matrix S is a 
random projection constructed as an n x I matrix of i.i.d. Gaussian random variables, as we use in 



Section 4.2.3 then the running time of dense data in RAM is not substantially faster than computing 
Vi, while the running time can be much faster for certain sparse matrices or for computation in parallel 
or distributed enviro nments. Alternately, if the sketching matrix S is a Fourier-based projection, as we 



use in Section 4.2.2 then the running time for data stored in RAM is typically 0(n 2 In k), as opposed to 
the 0(n 2 fc) time that would be needed to compute U 1 . These running times depend sensitively on the 
size of the data and the model of data access; see [1461 13 2 II for detailed discussions of these issues. 

In particular, for random sampling algorithms that use a leverage-based importance sampling dis- 
tribution, as we use in Section 4.2.1| it is often said that the running time is no faster than that of 
computing Uj. (This 0(n 2 fc) running time claim is simply the running time of the naive algorithm that 
computes U! "exactly," e.g., with a variant of the QR decomposition, and then reads off the Euclidean 
norms of the rows.) However, the randomized algorithm of II20II that computes relative-error approxi- 
mations to all of the statistical leverage in a time that is qualitatively faster — in worst-case theory and, 
by using existing high-quality randomized numerical code U3l 1501 132H , in practice — gets around this 
bottleneck, as was shown in Section [3} The computational bottleneck for the algorithms of 1 12 Oil is that 
of applying a random projection, and thus the running time for leverage-based Nystrom extension is 
that of applying a ("fast" Fourier-based or "slow" Gaussian-based, as appropriate) random projection to 
A |2D1. See Section [3] or Q3l [50l [32D for additional details. 

4.2.1. Sampling with Leverage-based Importance Sampling Probabilities. Here, the columns of A are 
sampled with replacement according to a nonuniform probability distribution determined by the (exact 
or approximate) statistical leverage scores of A relative to the best rank-fc approximation to A, which in 



50 



ALEX GITTENS AND MICHAEL W. MAHONEY 



turn depend on nonuniformity properties of the top fc-dimensional eigenspace of A. To add flexibility 
(e.g., in case the scores are computed only approximately with the fast algorithm of [20]), we formu- 
late the following lemma in terms of any probability distribution that is /3 -close to the leverage score 
distribution. In particular, consider any probability distribution satisfying 

Pj^^W^iWl and 2j=i*V = 1 ' 

where /3 e (0,1]. Given these (/3 -approximate) leverage-based probabilities, the sketching matrix is 
S = RD where R e R nx( is a column selection matrix that samples columns of A from the given 
distribution — i.e., R i; = 1 iff the tth column of A is the jth column selected — and D is a diagonal 
rescaling matrix satisfying D ;) = -j= iff Ry = 1. For this case, we can prove the following. 



flPi 



Lemma 2. Let A be an n x n SPSD matrix, qbe a positive integer, and S be a sampling matrix of size nx< 
corresponding to a leverage-based probability distribution derived from the top k-dimensional eigenspace 
of A, satisfying 

for some e (0, 1]. Fix a failure probability 5 e (0, 1] and approximation factor e e (0, 1], and let 



^fc+i(A) 
A fc (A) • 



If I > 3200(/3e 2 )~ 1 fcln(4fc/(/3<5)), then, when C = A q S and W = S r A 2q_1 S, the corresponding low- 
rank SPSD approximation satisfies 

(17) II. ™.xt„Tll ^11. .11 , f_2\\,* . ,2 -lll ^V(2q-1) 

(18) 
(19) 



CW J( C T \\ 2 < \\A - A fe || 2 + (e 2 ||(A - A,) 2 "" 1 ||J ' 



A-CW'C J 



If — 



< A-A 



A-CW'C' I < (1 + y q ~ e ) A — A 



k\\ ¥ ' 

2(q-l)_2^ 



2er ?-i + e 2 r 2 (q " 1) )||A-A, 



and 



simultaneously with probability at least 1 — 65 — 0.6. 

Proof. In [45 , proof of Proposition 22] it is shown that if £ satisfies the given bound and the samples are 
drawn from an approximate subspace probability distribution, then for any SPSD diagonal matrix D, 



< e D 



with probability at least 1 — 25 — 0.2. Thus, the estimates 



< € 



and 



^1/2 



2/(2(3-1) 



= e v / Tr(S 2 ) = e v /||A-A fc || it , 

<-i\ 



n 2/(2q-l) 

V 



< 



,q-l/2 



, 2 -x l/(2q-l) 

If J 

l/(2q-l) 



= ^|(A-A l ^-'||,) ,/(2 «- 1 ' 



each hold, individually, with probability at least 1 — 25 — 0.2. In particular, taking q — 1, we see that 
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with the same probability. 

These three estimates used in Theorems [TJ [2] and [3] yield the bounds given in the statement of 
the theorem. □ 

Remark. The additive scale factors for the spectral and Frobenius norm bounds are much improved 
relative to the prior results of II21II . At root, this is since the leverage score importance sampling 
probabilities highlight structural properties of the data (e.g., how to satisfy the condition in Theorems]!] 
[2j and [3] that has full row rank) in a more refined way than the importance sampling probabilities 

of eh. 

Remark. These improvements come at additional computational expense, but we remind the reader 
that leverage-based sampling probabilities of the form used by Lemma [2] can be computed faster than 
the time needed to compute the basis II20I1 . The computational bottleneck of the algorithm of [20] 
is the time required to perform a random projection on the input matrix. 

Remark. Not surprisingly, constant factors such as 3200 (as well as other similarly large factors below) 
and a failure probability bounded away from zero are artifacts of the analysis; the empirical behavior 
of this sampling method is much better. This has been observed previously [|22l 14811 . 

4.2.2. Random Projections with Subsampled Randomized Fourier Transforms. Here, the columns of A are 
randomly mixed using a unitary matrix before the columns are sampled. In particular, S = y^jDTR, 
where D is a diagonal matrix of Rademacher random variables, T is a highly incoherent unitary matrix, 
and R restricts to I columns. For concreteness, and because it has an associated fast transform, we 
consider the case where T is the normalized Fourier transform of size n x n. For this case, we can prove 
the following. 



Lemma 3. Let Abe an n x n SPSD matrix, qbe a positive integer, and S = ^ |DFR be a sampling matrix 
of size n x I, where I) is a diagonal matrix of Rademacher random variables, F is a normalized Fourier 
matrix of size nxn, and R restricts to I columns. Fix a failure probability 5 e (0, 1), approximation factor 
e e (0, 1), and assume that k > 4. Define 

_ A fc+ i(A) 
Y A fc (A) ' 

lft> 24e~ 1 [Vk + ^/81n(8n/5)] 2 ln(8fc/<5), then, when C = A^S and W = S T A 2 « _1 S, the correspond- 
ing low-rank SPSD approximation satisfies 

l/(2q-l)- 



A-CW'C J 



< 

|2 



5 + 



161n(n/5r 



A-A t 



(20) 



+ 



21n(n/<5) A 



l/(2q-l) 



< 



(1- 

A-A fc 



kA-A^- 1 !! 1 ^- 13 , 



+ {7r q ~ l ^ + 22 r 2q - 2 e) ||A-A* 

|a-aJI 



and 



A-CW'C 

1 1 1 1 r 1 1 "|| 

||A-CW i 'C r |L <(l + 22er 2(g - 1) )| 
simultaneously with probability at least 1 — 25. 

Proof. In | |131 proof of Theorem 4], it is shown that for this choice of S and number of samples 

1 




V81n(n/<5) 



+ v / 8ln(n/5)||s 2 
21n(n/5) ^ 
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and 



"^2^ ^2^{ 



< V22t 



nl/2 



= V 22e NL 



each hold, individually, with probability at least 1 — 5. These estimates used in Theorems [T] and[3]yield 
the stated bounds for the spectral and trace norm errors. 

The Frobenius norm bound follows from the same estimates and a simplification of the bound stated 
in Theorem |2j 



Ia-cw^I 



< F2 p + r' 



< s 



2 || F + r' 



q-1 
q-1 



,1/2 



2Tr(S 2 )+ r ^ 1 
V2Tr(S 2 )+ r 2 ^ 



< ||S 2 || F + (r q -'V44^ + 22 r 2q - 2 e) ||s 2 || + . 



We note that a direct application of Theorem [2] gives a potentially tighter, but more unwieldy, bound. 

□ 



Remark. Suppressing the dependence on 5 and e, the spectral norm bound ensures that when p — 1, 
k = n(lnn) and I = fi(fclnfc), then 

||A-CW^|| 2 = o(!^||A-A t || 2 + n L||A-A i | 

This should be compared to the guarantee established in Lemma [4] below for Gaussian-based SPSD 
sketches constructed using the same number of measurements: 



A-CW T C' 



O A -A* 



+ 



kink 



A -A, 



Lemma [3] guarantees that errors on this order can be achieved if one increases the number of samples 
by a logarithm factor in the dimension: specifically, such a bound is achieved when k = £l(ln n ) an d 
I = f2(fclnfclnn). The difference between the number of samples necessary for Fourier-based sketches 
and Gaussian-based sketches is reflective of the differing natures of the random projections: the geome- 
try of any fc-dimensional subspace is preserved under projection onto the span of I = O(fc) Gaussian ran- 
dom vectors [32], but the sharpest analysis available suggests that to preserve the geometry of such a 
subspace under projection onto the span of I SRFT vectors, I must satisfy I = f2(max{fc, lnn}lnfc) [60]. 
We note, however, that in practice the Fourier-based and Gaussian-based SPSD sketches have similar 
reconstruction errors. 

Remark. The structure of the Frobenius and trace norm bounds for the Fourier-based projection are 
identical to the structure of the corresponding bounds from Lemma[2]for leverage-based sampling (and 
the bounds could be made identical with appropriate choice of parameters). This is not surprising since 
(informally) Fourier-based (and other) random projections rotate to a random basis where the leverage 
scores are approximately uniform and thus where uniform sampling is appropriate [|46|| . The disparity 
of the spectral norm bounds suggests that leverage-based SPSD sketches should be expected to be more 



accurate in the spectral norm than Fourier-based sketches; the empirical results of Section 3.4 support 
this interpretation. The running times of the Fourier-based and the leverage-based algorithms are the 
same, to leading order, if the algorithm of [20] (which uses the same transform S = y^DHR) is used 
to approximate the leverage scores. 



4.2.3. Random Projections with Ltd. Gaussian Random Matrices. Here, the columns of A are randomly 
mixed using Gaussian random variables before sampling. Thus, the entries of the sampling matrix 



are i.i.d. standard Gaussian random variables. 
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Lemma 4. Let Abe an n x n SPSD matrix, qbe a positive integer, S e M" x ^ be a matrix of Ltd standard 
Gaussians, and define 



*fc+i(A) 
A fc (A) ' 



/f£ > 2e 2 klnk where e e (0, l)andk > 4, then, when C = A 9 S and W = S T A 2q 1 S, the corresponding 
low-rank SPSD approximation satisfies 



|A-C*C|,< l 1+ ^„_ + 8 74 T J 



A- A 



*||2 



.2 \ 1/(25-1) 



+ 219 



fclnfc 



A- A 



* *' 



A-CW" r C r |L < IIA-AJL + 



k 1 1 p 



42 14 
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Vfc vliTfc 
45 140 219 
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^/||A-A fc || 2 ||A-A fc |L 



v/fclnfc v'fclnfc 
140 437 



r 2c,-2 e 2[ • + '-"'"l Ha-A 



lnfc 



Vklnk k J 

* *' 



k \\2 



and 



( r 2q ~ 2 e 2 \ „ r 2q ~ 2 € 2 
A-CW 1 'C r < l + 45- ! - T -^ — I A-At ., +437^ — : IIA-A. 



k \\2 



simultaneously with probability at least 1 — 2k 1 — 4fc k ^ 2 . 

Proof As before, this result is established by bounding the quantities involved in Theorems [l] [2] and|3j 
The following deviation bounds, established in 11321 Section 10], are useful in that regard: if D is a 
diagonal matrix, I = k + p with p > 4 and u,t > 1, then 



> IIDII 



t H — • tu + ||D|| F — — 7 -t}< 2t~ p + e~" 12 , and 



p+1 p+1 



p + 1 



(21) 



> i|d|If w i , 

F \ p + 1 



3k e^fl n 

t + l|D|| 2 — — • tu > < 2t" p + e 



p + 1 



Write £ = fc + p. Since £ > 2e 2 fclnfc, we have that p > e 2 fclnfc. Accordingly, the following esti- 
mates hold: 
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Use these estimates and take t = e and u = V21nfc in Q21D to obtain that 



yj-i/2 



< 



< 2e z e 
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with the same probability. Likewise, 



2^2^ f2 2 0| 



3 
rnfc 
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< | ee W — l|S 2 ||, + -^= e-||S 2 || 2 

6 2 2 II II 8e4 2 II II 
lnfc 11 2|k fc 11 2,12 



1/2 



with the same probability. 

These estimates used in Theorems[T]and[3]yield the stated spectral and trace norm bounds. To obtain 
the corresponding Frobenius norm bound, define the quantities 



Gi = 



( 12e 2 16e 4 ' 



+ ■ 



^ lnfc ' fc 

^2 



Go = 3e^ 



G 2 = 4e 4 



fclnfc 



lnfc 

-2 



G 4 = 4e 4 



By Theorem j^j and our estimates for 
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The following estimates hold for the coefficients in this inequality: 
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The Frobenius norm bound follows from using these estimates in Equation ( |22D and grouping terms 
appropriately: 
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□ 



Remark. The way we have parameterized these bounds for Gaussian-based projections makes explicit 
the dependence on various parameters, but hides the structural simplicity of these bounds. In particular, 
note that the Frobenius norm bound is upper bounded by a term that depends on the Frobenius norm 
of the error and a term that depends on the trace norm of the error; and that, similarly, the trace norm 
bound is upper bounded by a multiplicative factor that can be set to 1 + e with an appropriate choice 
of parameters. 

4.2.4. Sampling Columns Uniformly at Random. Here, the columns of A are sampled uniformly at ran- 
dom (with or without replacement). Such uniformly-at-random column sampling only makes sense 
when the leverage scores of the top fc-dimensional invariant subspace of the matrix are sufficiently 
uniform that no column is significantly more informative than the others. For this case, we can prove 
the following. 

Lemma 5. Let Abe an n x n SPSD matrix, qbe a positive integer, and S be a sampling matrix of size ux< 
corresponding to sampling the columns of A uniformly at random (with or without replacement). Let [i 
denote the coherence of the top k-dimensional eigenspace of A and fix a failure probability 5 G (0, 1) and 
accuracy factor e e (0, 1). Define 



A fc +i(A) 
A fc (A) • 



If I > 2/ie" 2 fcln(fc/5), then, when C = A q S and W = S r A 2g_1 S, the corresponding low-rank SPSD 
approximation satisfies 
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simultaneously with probability at least 1 — 35. 
Proof. In [ 28 ] , it is shown that 

t 2 " 

n < 

1 2 - (1 - e)i 

with probability at least 1 — 5 when I satisfies the stated bound. Observe that II^JL < I U 2 L ||S|| 2 < 1, 
so that 



< 



with probability at least 1 — 5. Also, 
(23) 



,q-l/2 
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2 < Us II 29 " 1 " 



(1 - e)l 



(1 - e)i 



with at least the same probability. Observe that since S selects I columns uniformly at random, 



E 



S 2 /2lJ 2 S 



i=l 



EX; 



1/2 T 

where the summands X; are distributed uniformly at random over the columns of £ 2 U 2 . Regardless 
of whether S selects the columns with replacement or without replacement, the summands all have the 
same expectation: 



E||xJ| 2 = -Y||(E 2 /2 U 2 W = - 



;=i 



Consequently, 

so by Jensen's inequality 
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Now applying Markov's inequality to Q23[ ), we see that 

1 



< 



f _ 5 V (1 - e) 
with probability at least 1 — 25. Thus, we also know that 



S 2 



^2^ ^2^1 



< 



f (l-e)5" 



also with probability at least 1 — 25. These estimates used in Theorems[T]and[3]yield the stated spectral 
and trace norm bounds. 

To obtain the Frobenius norm bound, observe that Theorem [2] implies 
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Now substitute our estimate for 



S 2 ^ 2 J^^ 



to obtain the stated Frobenius norm bound. 



□ 
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Remark. As with previous bounds for uniform sampling, e.g., 11391 12811 . these results for uniform sam- 
pling are much weaker than our bounds from the previous subsections, since the sampling complexity 
depends on the coherence of the input matrix. When the matrix has small coherence, however, these 
bounds are similar to the bounds derived from the leverage-based sampling probabilities. Recall that, 
by the algorithm of [20], the coherence of an arbitrary input matrix can be computed in roughly the 
time it takes to perform a random projection on the input matrix. 



5. Discussion and Conclusion 

We have presented a unified approach to a large class of low-rank approximations of Laplacian and 
kernel matrices that arise in machine learning and data analysis applications; and in doing so we have 
provided qualitatively-improved worst-case theory and clarified the performance of these algorithms 
in practical settings. Our theoretical and empirical results suggest several obvious directions for fu- 
ture work. 

In general, our empirical evaluation demonstrates that, to obtain moderately high-quality low-rank 
approximations, as measured by minimizing the reconstruction error, depends in complicated ways 
on the spectral decay, the leverage score structure, the eigenvalue gaps in relevant parts of the spec- 
trum, etc. (Ironically, our empirical evaluation also demonstrates that all the sketches considered are 
reasonably-effective at approximating both sparse and dense, and both low-rank and high-rank matri- 
ces which arise in practice. That is, with only roughly 0(k) measurements, the spectral, Frobenius, and 
trace approximation errors stay within a small multiplicative factor of around 3 of the optimal rank-fc 
approximation errors. The reason for this is that matrices for which uniform sampling is least appropri- 
ate tend to be those which are least well-approximated by low-rank matrices, meaning that the residual 
error is much larger.) Thus, e.g., depending on whether one is interested in I being slightly larger or 
much larger than k, leverage-based sampling or a random projection might be most appropriate; and, 
more generally, an ensemble-based method that draws complementary strengths from each of these 
methods might be best. 

In addition, we should note that, in situations where one is concerned with the quality of approxi- 
mation of the actual eigenspaces, one desires both a small spectral norm error (because by the Davis- 
Kahan sin© theorem and similar perturbation results, this would imply that the range space of the 
sketch effectively captures the top fc-dimensional eigenspace of A) as well as to use as few samples as 
possible (because one prefers to approximate the top fc-dimension eigenspace of A with as close to a 
k-dimensional subspace as possible). Our results suggest that the leverage score probabilities supply 
the best sampling scheme for balancing these two competing objectives. 

More generally, although our empirical evaluation consists of random sampling and random projec- 
tion algorithms, our theoretical analysis clearly decouples the randomness in the algorithm from the 
structural heterogenities in the Euclidean vector space that are responsible for the poor performance 
of uniform sampling algorithms. Thus, if those structural conditions can be satisfied with a determin- 
istic algorithm, an iterative algorithm, or any other method, then one can certify (after running the 
algorithm) that good approximation guarantees hold for particular input matrices in less time than is 
required for general matrices. Moreover, this structural decomposition suggests greedy heuristics — 
e.g., greedily keep some number of columns according to approximate statistical leverage scores and 
"residualize." In our experience, a procedure of this form often performs quite well in practice, al- 
though theoretical guarantees tend to be much weaker; and thus we expect that, when coupled with 
our results, such procedures will perform quite well in practice in many medium-scale and large-scale 
machine learning applications. 
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